From 9b7abbfb51f8d34ecffe0c9ab0cfa628a3c15f61 Mon Sep 17 00:00:00 2001 From: Charles-Axel Dein <120501+charlax@users.noreply.github.com> Date: Sun, 20 Nov 2022 21:38:07 -0500 Subject: [PATCH] Add alerting chapter --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index e3ba067..d16d5de 100644 --- a/README.md +++ b/README.md @@ -829,6 +829,7 @@ Practice: - Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes. - The further up your serving stack you go, the more distinct problems you catch in a single rule. But don’t go so far you can’t sufficiently distinguish what’s going on. - If you want a quiet oncall rotation, it’s imperative to have a system for dealing with things that need timely response, but are not imminently critical. + - This classical article has now become a [chapter](https://sre.google/sre-book/monitoring-distributed-systems/) in Google's SRE book. - The Google SRE book's [chapter about oncall](https://landing.google.com/sre/workbook/chapters/on-call/) - [Writing Runbook Documentation When You’re An SRE](https://www.transposit.com/blog/2020.01.30-writing-runbook-documentation-when-youre-an-sre/) - Playbooks “reduce stress, the mean time to repair (MTTR), and the risk of human error.”