Postmortem Readings

Last reviewed on 4 May 2026.

Analytical readings of publicly-disclosed API and infrastructure incidents, framed through an API-design lens. Each piece pairs with the official postmortem from the affected company.

Why this series exists

Public postmortems from the major API and infrastructure providers are some of the most useful documents in the field. They describe specific failures with specific causes, in concrete language. The official postmortems do an excellent job of saying what happened. They typically don't have space to draw out the broader lessons — what design decisions made the failure possible, how the same patterns surface in other systems, what other API providers should change as a result.

This series fills that gap. Each piece is a reading of a specific public incident: a summary of what the official postmortem says, followed by an analysis of what the incident illustrates about API and system design. The aim is the lesson, not the gossip; we attribute everything we claim about each incident to the public source and don't speculate about details the affected company didn't disclose.

Currently available

Reading Stripe's API Versioning Approach

Not an outage but an architectural choice worth reading as carefully: how Stripe handles API versioning, why their approach has aged better than alternatives, and what other API providers should learn from it.

Reading the AWS US-EAST-1 December 2021 Outage

A multi-hour outage triggered by automated capacity scaling that interacted badly with the network monitoring system. A case study in control-plane vs data-plane design and the danger of automated remediation that amplifies its own failures.

Reading the Cloudflare November 2023 Control-Plane Outage

A two-day control-plane outage triggered by a power failure at a single data center. A case study in the limits of regional redundancy, the importance of exercising failover paths, and what makes a postmortem credible.

How we choose what to read

Not every public postmortem makes a good reading. We're looking for ones that satisfy several criteria:

  • The official account is detailed enough to support meaningful analysis. Vague "we experienced a service degradation" statements don't give us anything to work with.
  • The incident or design choice illustrates a pattern that generalizes beyond the specific company. A bug in someone's deploy script is not interesting; a class of failure that recurs across providers is.
  • The lesson isn't already obvious. "Don't have a single point of failure" doesn't need a 2,000-word piece; "here's how a multi-region architecture still failed because of an upstream dependency you didn't think about" does.
  • Enough time has passed that the affected company has had room to respond and the dust has settled. Reading hot takes the day after an incident is mostly noise.

Most pieces in this series will be 2,000–3,000 words. They're paired with the appropriate reference pages on the rest of the site so readers can dig deeper into the principles each incident illustrates.

What's coming

The backlog of incidents and architectural decisions worth reading carefully is long. Likely upcoming pieces:

  • The 2024 Crowdstrike incident as a case study in update-distribution APIs and the absence of staged rollouts.
  • GitHub's pattern of multi-day post-incident analyses as a model for engineering communication.
  • The 2017 Amazon S3 outage and the lessons that survived for the API community.
  • Twilio's webhook delivery system as a case study in operational APIs at scale.
  • The Snowflake credential exposures of 2024 as a study in shared-responsibility and customer authentication.

If there's a public postmortem you'd like to see read carefully, the contact page has the address.

Where to go next

For the patterns and references that the postmortem analyses lean on, see the reference. For the long-form articles that go deeper on specific subtopics, see the rest of the blog. For the curated reading list of foundational pieces in API design, see the canon.