The PagerDuty Post-Mortem Handbook
Post-Mortems Are Necessary
No major incident is ever truly resolved without a post-mortem. Post-mortems are a great way for development teams to identify and analyze elements of a project that were successful or unsuccessful. It’s a way to look back and review the incident in detail to determine exactly what went wrong, why it went wrong, and what can be done in the future to make sure it doesn’t happen again.
Sharing Our Incident Response Process
Reliability has always been one of the primary design considerations at PagerDuty. But what do we do when the unexpected happens and something does go wrong? It’s of the utmost importance that we are prepared and can get our systems back into full working order as quickly as possible. We pride ourselves on being able to quickly resolve issues that arise and keep our systems working within their SLA. We’ve worked very hard to accomplish this, and our incident response process is where it all begins.
Our internal incident response documentation is something we’ve built up over the last few years as we’ve learned from our mistakes. It details the best practices of our process, from how to prepare new employees for on-call responsibilities, to how to handle major incidents, both in preparation and after-work. Few companies seem to talk about their internal processes for dealing with major incidents. It’s sometimes considered taboo to even mention the word “incident” in any sort of communication. We would like to change that.
To that end, we’d like to share how we here at PagerDuty conduct post-mortems internally. It is our hope that others will use the documentation as a starting point to formalize their own processes. This guide provides information on what to do after a major incident and shares PagerDuty’s follow-up and after-action review procedures.