Add alerting chapter

This commit is contained in:
Charles-Axel Dein 2022-11-20 21:38:07 -05:00
parent 2004ffdfe2
commit 9b7abbfb51
No known key found for this signature in database

View File

@ -829,6 +829,7 @@ Practice:
- Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
- The further up your serving stack you go, the more distinct problems you catch in a single rule. But dont go so far you cant sufficiently distinguish whats going on.
- If you want a quiet oncall rotation, its imperative to have a system for dealing with things that need timely response, but are not imminently critical.
- This classical article has now become a [chapter](https://sre.google/sre-book/monitoring-distributed-systems/) in Google's SRE book.
- The Google SRE book's [chapter about oncall](https://landing.google.com/sre/workbook/chapters/on-call/)
- [Writing Runbook Documentation When Youre An SRE](https://www.transposit.com/blog/2020.01.30-writing-runbook-documentation-when-youre-an-sre/)
- Playbooks “reduce stress, the mean time to repair (MTTR), and the risk of human error.”