Postmortem Readings

Why this series exists

Public postmortems from the major API and infrastructure providers are some of the most useful documents in the field. They describe specific failures with specific causes, in concrete language. The official postmortems do an excellent job of saying what happened. They typically don't have space to draw out the broader lessons — what design decisions made the failure possible, how the same patterns surface in other systems, what other API providers should change as a result.

This series fills that gap. Each piece is a reading of a specific public incident: a summary of what the official postmortem says, followed by an analysis of what the incident illustrates about API and system design. The aim is the lesson, not the gossip; we attribute everything we claim about each incident to the public source and don't speculate about details the affected company didn't disclose.

Currently available

How we choose what to read

Not every public postmortem makes a good reading. We're looking for ones that satisfy several criteria:

The official account is detailed enough to support meaningful analysis. Vague "we experienced a service degradation" statements don't give us anything to work with.
The incident or design choice illustrates a pattern that generalizes beyond the specific company. A bug in someone's deploy script is not interesting; a class of failure that recurs across providers is.
The lesson isn't already obvious. "Don't have a single point of failure" doesn't need a 2,000-word piece; "here's how a multi-region architecture still failed because of an upstream dependency you didn't think about" does.
Enough time has passed that the affected company has had room to respond and the dust has settled. Reading hot takes the day after an incident is mostly noise.

Most pieces in this series will be 2,000–3,000 words. They're paired with the appropriate reference pages on the rest of the site so readers can dig deeper into the principles each incident illustrates.

What's coming

The backlog of incidents and architectural decisions worth reading carefully is long. Likely upcoming pieces:

The 2024 Crowdstrike incident as a case study in update-distribution APIs and the absence of staged rollouts.
GitHub's pattern of multi-day post-incident analyses as a model for engineering communication.
The 2017 Amazon S3 outage and the lessons that survived for the API community.
Twilio's webhook delivery system as a case study in operational APIs at scale.
The Snowflake credential exposures of 2024 as a study in shared-responsibility and customer authentication.

If there's a public postmortem you'd like to see read carefully, the contact page has the address.

Where to go next

For the patterns and references that the postmortem analyses lean on, see the reference. For the long-form articles that go deeper on specific subtopics, see the rest of the blog. For the curated reading list of foundational pieces in API design, see the canon.

Why this series exists

Currently available

Reading Stripe's API Versioning Approach

Reading the AWS US-EAST-1 December 2021 Outage

Reading the Cloudflare November 2023 Control-Plane Outage

How we choose what to read

What's coming

Where to go next