Purchase this theme on shadcnblocks.com
← BlogBerlin
Dispatch № 007February MMXXVI

Burning the runbook.

We replaced two hundred pages with one paragraph — and the paragraph wrote itself.

Lede

The runbook was two hundred and fourteen pages long. The current version is a paragraph. We did not lose anything we needed.

The binder

The runbook had grown the way runbooks grow — one tragedy at a time. After every incident a paragraph was appended that explained, in apologetic detail, what should have happened. After enough quarters, the runbook was a kind of folk history of the team's mistakes. It was thorough. It was also, by any honest reading, unread.

The audit

We instrumented the runbook the way we had instrumented the dashboards. We logged which sections were opened during incidents, how long they were read, and whether the engineer reading them changed their behavior as a result. After ninety days, the verdict was unambiguous:

  • Six pages had been opened during an incident
  • Two of those had been read for longer than ten seconds
  • One had changed an engineer's decision

Every other page was historical comfort — useful when the runbook was being written, useless when the system was on fire.

The paragraph

We deleted the binder. In its place we kept a paragraph that the system writes for each incident in real time. It contains four things, in this order:

The service that is on fire. The decision the system has already made. The thing a human can do that the system cannot. The page of the binder that we know we are violating, if any.

The paragraph is not a runbook. It is a note from the system to the human, written at the moment the human is needed. The runbook had tried to anticipate every situation; the paragraph admits that nothing useful can be said about a situation until the situation has begun.

The weeks after

For three weeks the paragraph was met with suspicion. Engineers would, mid-incident, ask in chat where the runbook had gone. We would point them at the paragraph. They would read it, look around for the rest, and slowly realize that the rest had been the problem.

By the fourth week the paragraph was being trusted. By the eighth, an engineer admitted, with some embarrassment, that the only thing she missed about the binder was the smell. We have not reprinted it.