From ca19dd1db482a876e552b04a0dc889c4acbd12d7 Mon Sep 17 00:00:00 2001
From: Charles-Axel Dein <ca@d3in.org>
Date: Fri, 15 May 2020 18:23:37 +0200
Subject: [PATCH] Create postmortem section

---
 README.md | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/README.md b/README.md
index ef5b265..02a569b 100644
--- a/README.md
+++ b/README.md
@@ -423,10 +423,9 @@ The best way to learn is to learn by doing.
 * [The elevator programming game](http://play.elevatorsaga.com/)
 * [Challenging projects every programmer should try](http://web.eecs.utk.edu/~azh/blog/challengingprojects.html): text editor, space invaders, compiler (Tiny Basic), mini OS, spreadsheet, video game console emulator.
 
-### Incident response (oncall, alerting, outages, firefighting)
+### Incident response (oncall, alerting, outages, firefighting, postmortem)
 
 * [Incident Response at Heroku](https://blog.heroku.com/archives/2014/5/9/incident-response-at-heroku)
-* [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)
 * [My Philosophy On Alerting](https://linuxczar.net/sysadmin/philosophy-on-alerting/)
   * Pages should be urgent, important, actionable, and real.
   * Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
@@ -434,7 +433,6 @@ The best way to learn is to learn by doing.
   * Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
   * The further up your serving stack you go, the more distinct problems you catch in a single rule. But don’t go so far you can’t sufficiently distinguish what’s going on.
   * If you want a quiet oncall rotation, it’s imperative to have a system for dealing with things that need timely response, but are not imminently critical.
-* A great example of a [postmortem from Gitlab (01/31/2017)](https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/) for an outage during which an engineer's action caused the irremediable loss of 6 hours of data.
 * The Google SRE book's [chapter about oncall](https://landing.google.com/sre/workbook/chapters/on-call/)
 * [Writing Runbook Documentation When You’re An SRE](https://www.transposit.com/blog/2020.01.30-writing-runbook-documentation-when-youre-an-sre/)
   * Playbooks “reduce stress, the mean time to repair (MTTR), and the risk of human error.”
@@ -443,6 +441,15 @@ The best way to learn is to learn by doing.
   * Make your content easy to glance over.
   * If a script is longer than a single line, treat it like code, and check it into a repository to be source control and potentially tested.
 
+#### Postmortem
+
+* A great example of a [postmortem from Gitlab (01/31/2017)](https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/) for an outage during which an engineer's action caused the irremediable loss of 6 hours of data.
+* [Blameless PostMortems and a Just Culture](https://codeascraft.com/2012/05/22/blameless-postmortems/)
+* [A list of postmortems on Github](https://github.com/danluu/post-mortems)
+* Google's SRE book, [Postmortem chapter](https://landing.google.com/sre/workbook/chapters/postmortem-culture/) is excellent and includes many examples.
+
+Dan Milstein, “Let’s plan for a future where we’re all as stupid as we are today.”
+
 ### Internet
 
 * [How Does the Internet Work?](https://web.stanford.edu/class/msande91si/www-spr04/readings/week1/InternetWhitepaper.htm)