A story of the failure of a pumped energy storage facility, involving all of our favorite features like complex contributing factors, work-as-done vs work-as-designed, and early warning signs only obvious in hindsight. As a bonus, no one was killed.
Practical Engineering
Nebula‘s streaming service has a surprisingly write-heavy workflow owing to storing bookmarks of the latest point a given user has watched in a video. That makes scaling an interesting challenge.
Sam Rose — Nebula
I love the debugging technique they used: kill processes one at a time until performance improves.
Samson Hu, Shashank Tavildar, Eric Kalkanger, and Hunter Gatewood — Pinterest
This article is about finding the balance between having enough process to ensure incident response goes smoothly, and having so much process that incident responders are unable to adapt to unexpected situations.
Brandon Chalk — Rootly
This article presents two case studies of dialog during incidents along with analysis of each. How does your own analysis compare?
Hamed Silatani — Uptime Labs
They realized that a single alert can’t catch both a sudden AC failure and an AC that becomes slowly but steadily overwhelmed.
Chris Siebenmann
Thoughts on migrations as a significant source of reliability risk.
[…] engineering organizations at tech companies need to make migrations a part of their core competency, rather than seeing them as one-off chores.
Lorin Hochstein
An incorrect physical disconnection was made to the active network switch serving our control plane, rather than the redundant unit scheduled for removal.
This reminds me of wrong-side surgery incidents and aircraft pilots shutting off the good engine when one fails.