Anne Baretta discovered a nice video describing the October 2018 GitHub failure. Right here’s the TL&DW:
- The failure was attributable to a brief (~ 1 minute) disconnect of the first information middle
- The database replicas failed over to the secondary information middle, however that failover was by no means examined and naturally some stuff didn’t work.
- Within the meantime, batch jobs modified information within the main information middle, making the 2 replicas out-of-sync.
- It took them over 24 hours to scrub up the mess.
You REALLY SHOULD watch the video – it properly proves two factors I’ve been making for ages (not that anybody would pay attention):