Reorganize reliability

2024-03-08 09:57:22 -05:00 · 2024-03-08 09:57:22 -05:00 · b1340c4474
commit b1340c4474
parent 321af44dc2
1 changed files with 104 additions and 81 deletions
--- a/README.md
+++ b/README.md
@ -80,6 +80,7 @@
    - [Observability (monitoring, logging, exception handling)](#observability-monitoring-logging-exception-handling)
      - [Logging](#logging)
      - [Error/exception handling](#errorexception-handling)
+      - [Metrics](#metrics)
      - [Monitoring](#monitoring)
    - [Open source](#open-source)
    - [Operating system (OS)](#operating-system-os)
@ -106,15 +107,17 @@
      - [Checklists](#checklists)
      - [Feature flags](#feature-flags)
      - [Testing in production](#testing-in-production)
+    - [Reliability](#reliability)
+      - [Resiliency](#resiliency)
    - [Search](#search)
    - [Security](#security)
    - [Shell (command line)](#shell-command-line)
    - [SQL](#sql)
    - [System administration](#system-administration)
    - [System architecture](#system-architecture)
-      - [Scalability](#scalability)
-      - [Reliability](#reliability)
-      - [Resiliency](#resiliency)
+      - [Architecture patterns](#architecture-patterns)
+      - [Microservices/splitting a monolith](#microservicessplitting-a-monolith)
+    - [Scalability](#scalability)
    - [Site Reliability Engineering (SRE)](#site-reliability-engineering-sre)
    - [Technical debt](#technical-debt)
    - [Testing](#testing)
@ -1029,14 +1032,6 @@ Practice:
 - [Incident Response at Heroku](https://blog.heroku.com/archives/2014/5/9/incident-response-at-heroku)
  - Described the Incident Commander role, inspired by natural disaster incident response.
  - Also in presentation: [Incident Response Patterns: What we have learned at PagerDuty - Speaker Deck](https://speakerdeck.com/arupchak/incident-response-patterns-what-we-have-learned-at-pagerduty)
- [My Philosophy On Alerting](https://linuxczar.net/sysadmin/philosophy-on-alerting/)
-  - Pages should be urgent, important, actionable, and real.
-  - Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
-  - Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
-  - Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
-  - The further up your serving stack you go, the more distinct problems you catch in a single rule. But don’t go so far you can’t sufficiently distinguish what’s going on.
-  - If you want a quiet oncall rotation, it’s imperative to have a system for dealing with things that need timely response, but are not imminently critical.
-  - This classical article has now become a [chapter](https://sre.google/sre-book/monitoring-distributed-systems/) in Google's SRE book.
 - The Google SRE book's [chapter about oncall](https://landing.google.com/sre/workbook/chapters/on-call/)
 - [Writing Runbook Documentation When You’re An SRE](https://www.transposit.com/blog/2020.01.30-writing-runbook-documentation-when-youre-an-sre/)
  - Playbooks “reduce stress, the mean time to repair (MTTR), and the risk of human error.”
@ -1049,6 +1044,18 @@ Practice:
 - [Incident Management Resources](https://resources.sei.cmu.edu/library/asset-view.cfm?assetID=505044), Carnegie Mellon University
 - [Sterile flight deck rule](https://en.wikipedia.org/wiki/Sterile_flight_deck_rule), Wikipedia

+Alerting:
+
+- [My Philosophy On Alerting](https://linuxczar.net/sysadmin/philosophy-on-alerting/)
+  - Pages should be urgent, important, actionable, and real.
+  - Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.
+  - Symptoms are a better way to capture more problems more comprehensively and robustly with less effort.
+  - Include cause-based information in symptom-based pages or on dashboards, but avoid alerting directly on causes.
+  - The further up your serving stack you go, the more distinct problems you catch in a single rule. But don’t go so far you can’t sufficiently distinguish what’s going on.
+  - If you want a quiet oncall rotation, it’s imperative to have a system for dealing with things that need timely response, but are not imminently critical.
+  - This classical article has now become a [chapter](https://sre.google/sre-book/monitoring-distributed-systems/) in Google's SRE book.
+- 🏙 [The Paradox of Alerts](https://speakerdeck.com/charity/the-paradox-of-alerts): why deleting 90% of your paging alerts can make your systems better, and how to craft an on-call rotation that engineers are happy to join.
+
 #### Postmortem

 - A great example of a [postmortem from Gitlab (01/31/2017)](https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/) for an outage during which an engineer's action caused the irremediable loss of 6 hours of data.
@ -1301,6 +1308,8 @@ Richard Feynman's Learning Strategy:

 ### Observability (monitoring, logging, exception handling)

+*See also: [Site Reliability Engineering (SRE)](#site-reliability-engineering-sre)*
+
 #### Logging

 - [Do not log](https://sobolevn.me/2020/03/do-not-log) dwells on some logging antipatterns.
@ -1324,6 +1333,13 @@ Richard Feynman's Learning Strategy:
  - Explain the solution
  - Write clearly

+#### Metrics
+
+- [Meaningful availability](https://blog.acolyer.org/2020/02/26/meaningful-availability/)
+  - A good availability metric should be meaningful, proportional, and actionable. By "meaningful" we mean that it should capture what users experience. By "proportional" we mean that a change in the metric should be proportional to the change in user-perceived availability. By "actionable" we mean that the metric should give system owners insight into why availability for a period was low. This paper shows that none of the commonly used metrics satisfy these requirements…
+- 📃 [Meaningful Availability](https://www.usenix.org/conference/nsdi20/presentation/hauer) paper.
+  - This paper presents and evaluates a novel availability metric: windowed user-uptime
+
 #### Monitoring

 - Google, [Site Reliability Engineering, Monitoring Distributed Systems](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/)
@ -1602,6 +1618,65 @@ JavaScript is such a pervasive language that it's almost required learning.
    - Have proper visibility from a client/end-user standpoint (client-side metrics)
 - [Testing in Production, the safe way](https://medium.com/@copyconstruct/testing-in-production-the-safe-way-18ca102d0ef1)

+### Reliability
+
+*See also [System architecture](#system-architecture)*
+
+Books:
+
+- 📖 [Site Reliability Engineering](https://landing.google.com/sre/books/)
+  - Written by members of Google's SRE team, with a comprehensive analysis of the entire software lifecycle - how to build, deploy, monitor, and maintain large scale systems.
+
+Quotes:
+
+> Quality is a snapshot at the start of life and reliability is a motion picture of the day-by-day operation.
+> – [NIST](https://www.itl.nist.gov/div898/handbook/apr/section1/apr111.htm)
+
+> Reliability is the one feature every customer users. -- An auth0 SRE.
+
+Articles:
+
+- I already mentioned the book Release it! above. There's also a [presentation](http://www.slideshare.net/justindorfman/stability-patterns-presentation) from the author.
+- [Service Recovery: Rolling Back vs. Forward Fixing](https://www.linkedin.com/pulse/service-recovery-rolling-back-vs-forward-fixing-mohamed-el-geish/)
+- [How Complex Systems Fail](https://how.complexsystems.fail/)
+    - Catastrophe requires multiple failures – single point failures are not enough.
+    - Complex systems contain changing mixtures of failures latent within them.
+    - Post-accident attribution to a ‘root cause’ is fundamentally wrong.
+    - Hindsight biases post-accident assessments of human performance.
+    - Safety is a characteristic of systems and not of their components
+    - Failure free operations require experience with failure.
+- [Systems that defy detailed understanding](https://blog.nelhage.com/post/systems-that-defy-understanding/)
+    - Focus effort on systems-level failure, instead of the individual component failure.
+    - Invest in sophisticated observability tools, aiming to increase the number of questions we can ask without deploying custom code
+- [Operating a Large, Distributed System in a Reliable Way: Practices I Learned](https://blog.pragmaticengineer.com/operating-a-high-scale-distributed-system/), Gergely Orosz.
+  - A good summary of processes to implement.
+- [Production Oriented Development](https://paulosman.me/2019/12/30/production-oriented-development.html)
+  - Code in production is the only code that matters
+  - Engineers are the subject matter experts for the code they write and should be responsible for operating it in production.
+  - Buy Almost Always Beats Build
+  - Make Deploys Easy
+  - Trust the People Closest to the Knives
+  - QA Gates Make Quality Worse
+  - Boring Technology is Great.
+  - Non-Production Environments Have Diminishing Returns
+  - Things Will Always Break
+- 🏙 [High Reliability Infrastructure migrations](https://speakerdeck.com/jvns/high-reliability-infrastructure-migrations), Julia Evans.
+- [Appendix F: Personal Observations on the Reliability of the Shuttle](https://www.refsmmat.com/files/reflections.pdf), Richard Feynman
+
+Resources:
+
+- 🧰 [dastergon/awesome-sre](https://github.com/dastergon/awesome-sre)
+- 🧰 [upgundecha/howtheysre](https://github.com/upgundecha/howtheysre): a curated collection of publicly available resources on SRE at technology and tech-savvy organizations
+
+#### Resiliency
+
+- 🏙 [The Walking Dead - A Survival Guide to Resilient Applications](https://speakerdeck.com/daschl/the-walking-dead-a-survival-guide-to-resilient-applications)
+- 🏙 [Defensive Programming & Resilient systems in Real World (TM)](https://speakerdeck.com/tuenti/defensive-programming-and-resilient-systems-in-real-world-tm)
+- 🏙 [Full Stack Fest: Architectural Patterns of Resilient Distributed Systems](https://speakerdeck.com/randommood/full-stack-fest-architectural-patterns-of-resilient-distributed-systems)
+- 🏙 [The 7 quests of resilient software design](https://www.slideshare.net/ufried/the-7-quests-of-resilient-software-design)
+- 🧰 [Resilience engineering papers](https://github.com/lorin/resilience-engineering): comprehensive list of resources on resilience engineering
+- [MTTR is more important than MTBF (for most types of F)](https://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/) (also as a [presentation](https://www.slideshare.net/jallspaw/dev-and-ops-collaboration-and-awareness-at-etsy-and-flickr))
+
 ### Search

 - [What every software engineer should know about search](https://scribe.rip/p/what-every-software-engineer-should-know-about-search-27d1df99f80d)
@ -1686,6 +1761,8 @@ List of resources:

 ### System architecture

+*See also [Reliability](#system-architecture), [Scalability](#scalability)*
+
 Reading lists:

 - 🧰 [donnemartin/system-design-primer](https://github.com/donnemartin/system-design-primer): learn how to design large scale systems. Prep for the system design interview.
@ -1707,15 +1784,11 @@ Books:
 Articles:

 - [6 Rules of thumb to build blazing fast web server applications](http://loige.co/6-rules-of-thumb-to-build-blazing-fast-web-applications/)
- [Service oriented architecture: scaling the Uber engineering codebase as we grow](https://eng.uber.com/soa/)
+
 - [The twelve-factor app](http://12factor.net/)
 - [Introduction to architecting systems for scale](http://lethain.com/introduction-to-architecting-systems-for-scale/)
 - [The Log: What every software engineer should know about real-time data's unifying abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying): one of those classical articles that everyone should read.
 - [Turning the database outside-out with Apache Samza](https://www.confluent.io/blog/turning-the-database-inside-out-with-apache-samza/)
- [Scaling to 100k Users](https://alexpareto.com/scalability/systems/2020/02/03/scaling-100k.html), Alex Pareto. The basics of getting from 1 to 100k users.
- [Systems that defy detailed understanding](https://blog.nelhage.com/post/systems-that-defy-understanding/)
-    - Focus effort on systems-level failure, instead of the individual component failure.
-    - Invest in sophisticated observability tools, aiming to increase the number of questions we can ask without deploying custom code
 - [Fallacies of distributed computing](https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing), Wikipedia
 - [The biggest thing Amazon got right: the platform](https://gigaom.com/2011/10/12/419-the-biggest-thing-amazon-got-right-the-platform/)
    - All teams will henceforth expose their data and functionality through service interfaces.
@ -1725,12 +1798,18 @@ Articles:
 - [Building Services at Airbnb, part 4](https://medium.com/airbnb-engineering/building-services-at-airbnb-part-4-23c95e428064)
  - Building Schema Based Testing Infrastructure for service development
 - [Patterns of Distributed Systems](https://martinfowler.com/articles/patterns-of-distributed-systems/), MartinFowler.com
- [ConwaysLaw](https://martinfowler.com/bliki/ConwaysLaw.html), MartinFowler.com (regarding organization, check out my [engineering-management](https://github.com/charlax/engineering-management/) list.
+- [ConwaysLaw](https://martinfowler.com/bliki/ConwaysLaw.html), MartinFowler.com (regarding organization, check out my [engineering-management](https://github.com/charlax/engineering-management/) list).
 - [The C4 model for visualising software architecture](https://c4model.com/)
 - [If Architects had to work like Programmers](http://www.gksoft.com/a/fun/architects.html)

-Microservices/splitting a monolith:
+#### Architecture patterns

+- BFF (backend for frontend)
+  - [Backends For Frontends](https://samnewman.io/patterns/architectural/bff/)
+
+#### Microservices/splitting a monolith
+
+- [Service oriented architecture: scaling the Uber engineering codebase as we grow](https://eng.uber.com/soa/)
 - [Don’t start with microservices in production – monoliths are your friend](https://arnoldgalovics.com/microservices-in-production/)
 - [Deep lessons from Google And EBay on building ecosystems of microservices](http://highscalability.com/blog/2015/12/1/deep-lessons-from-google-and-ebay-on-building-ecosystems-of.html)
 - [Introducing domain-oriented microservice architecture](https://eng.uber.com/microservice-architecture/), Uber
@ -1745,74 +1824,17 @@ Microservices/splitting a monolith:
 - [Death by a thousand microservices](https://renegadeotter.com/2023/09/10/death-by-a-thousand-microservices.html)
  - [Microservices](https://www.youtube.com/watch?v=y8OnoxKotPQ&ab_channel=KRAZAM)

-#### Scalability
+### Scalability
+
+*See also: [Reliability](#reliability), [System architecture](#system-architecture)*

 - [Scalable web architecture and distributed systems](http://www.aosabook.org/en/distsys.html)
 - 📖 [Scalability Rules: 50 Principles for Scaling Web Sites](https://smile.amazon.com/Scalability-Rules-Principles-Scaling-Sites/dp/013443160X) ([presentation](http://www.slideshare.net/cyrilwang/scalability-rules))
-
-#### Reliability
-
-> Quality is a snapshot at the start of life and reliability is a motion picture of the day-by-day operation.
-> – [NIST](https://www.itl.nist.gov/div898/handbook/apr/section1/apr111.htm)
-
- I already mentioned the book Release it! above. There's also a [presentation](http://www.slideshare.net/justindorfman/stability-patterns-presentation) from the author.
- [Service Recovery: Rolling Back vs. Forward Fixing](https://www.linkedin.com/pulse/service-recovery-rolling-back-vs-forward-fixing-mohamed-el-geish/)
- [How Complex Systems Fail](https://how.complexsystems.fail/)
-    - Catastrophe requires multiple failures – single point failures are not enough.
-    - Complex systems contain changing mixtures of failures latent within them.
-    - Post-accident attribution to a ‘root cause’ is fundamentally wrong.
-    - Hindsight biases post-accident assessments of human performance.
-    - Safety is a characteristic of systems and not of their components
-    - Failure free operations require experience with failure.
- 🧰 [Testing Distributed Systems](https://asatarin.github.io/testing-distributed-systems/)
-
-#### Resiliency
-
- 🏙 [The Walking Dead - A Survival Guide to Resilient Applications](https://speakerdeck.com/daschl/the-walking-dead-a-survival-guide-to-resilient-applications)
- 🏙 [Defensive Programming & Resilient systems in Real World (TM)](https://speakerdeck.com/tuenti/defensive-programming-and-resilient-systems-in-real-world-tm)
- 🏙 [Full Stack Fest: Architectural Patterns of Resilient Distributed Systems](https://speakerdeck.com/randommood/full-stack-fest-architectural-patterns-of-resilient-distributed-systems)
- 🏙 [The 7 quests of resilient software design](https://www.slideshare.net/ufried/the-7-quests-of-resilient-software-design)
- 🧰 [Resilience engineering papers](https://github.com/lorin/resilience-engineering): comprehensive list of resources on resilience engineering
- [MTTR is more important than MTBF (for most types of F)](https://www.kitchensoap.com/2010/11/07/mttr-mtbf-for-most-types-of-f/) (also as a [presentation](https://www.slideshare.net/jallspaw/dev-and-ops-collaboration-and-awareness-at-etsy-and-flickr))
+- [Scaling to 100k Users](https://alexpareto.com/scalability/systems/2020/02/03/scaling-100k.html), Alex Pareto. The basics of getting from 1 to 100k users.

 ### Site Reliability Engineering (SRE)

-*Note: this section is only about SRE as a role. Checkout the System Architecture for more content related to reliability.*
-
-Books:
-
- 📖 [Site Reliability Engineering](https://landing.google.com/sre/books/)
-  - Written by members of Google's SRE team, with a comprehensive analysis of the entire software lifecycle - how to build, deploy, monitor, and maintain large scale systems.
-
-Articles:
-
- [Graduating from Bootcamp and interested in becoming a Site Reliability Engineer?](https://medium.com/@tammybutow/graduating-from-bootcamp-and-interested-in-becoming-a-site-reliability-engineer-b69a38ce858b): a great collection of resources to learn about SRE.
- [Operating a Large, Distributed System in a Reliable Way: Practices I Learned](https://blog.pragmaticengineer.com/operating-a-high-scale-distributed-system/), Gergely Orosz.
-  - A good summary of processes to implement.
- [Production Oriented Development](https://paulosman.me/2019/12/30/production-oriented-development.html)
-  - Code in production is the only code that matters
-  - Engineers are the subject matter experts for the code they write and should be responsible for operating it in production.
-  - Buy Almost Always Beats Build
-  - Make Deploys Easy
-  - Trust the People Closest to the Knives
-  - QA Gates Make Quality Worse
-  - Boring Technology is Great.
-  - Non-Production Environments Have Diminishing Returns
-  - Things Will Always Break
- [Meaningful availability](https://blog.acolyer.org/2020/02/26/meaningful-availability/)
-  - A good availability metric should be meaningful, proportional, and actionable. By "meaningful" we mean that it should capture what users experience. By "proportional" we mean that a change in the metric should be proportional to the change in user-perceived availability. By "actionable" we mean that the metric should give system owners insight into why availability for a period was low. This paper shows that none of the commonly used metrics satisfy these requirements…
- 📃 [Meaningful Availability](https://www.usenix.org/conference/nsdi20/presentation/hauer) paper.
-  - This paper presents and evaluates a novel availability metric: windowed user-uptime
- 🏙 [High Reliability Infrastructure migrations](https://speakerdeck.com/jvns/high-reliability-infrastructure-migrations), Julia Evans.
- 🏙 [The Paradox of Alerts](https://speakerdeck.com/charity/the-paradox-of-alerts): why deleting 90% of your paging alerts can make your systems better, and how to craft an on-call rotation that engineers are happy to join.
- [Appendix F: Personal Observations on the Reliability of the Shuttle](https://www.refsmmat.com/files/reflections.pdf), Richard Feynman
-
-> Reliability is the one feature every customer users. -- An auth0 SRE.
-
-Resources:
-
- 🧰 [dastergon/awesome-sre](https://github.com/dastergon/awesome-sre)
- [upgundecha/howtheysre](https://github.com/upgundecha/howtheysre): a curated collection of publicly available resources on SRE at technology and tech-savvy organizations
+*See: [Reliability](#reliability)*

 ### Technical debt

@ -1828,6 +1850,7 @@ Resources:
 ### Testing

 - ⭐️ [Testing strategies in a microservices architecture](http://martinfowler.com/articles/microservice-testing/) (Martin Fowler) is an awesome resources explaining how to test a service properly.
+- 🧰 [Testing Distributed Systems](https://asatarin.github.io/testing-distributed-systems/)

 Why test:

@ -2025,7 +2048,7 @@ Blogs:
 - [OOP](https://en.wikipedia.org/wiki/Object-oriented_programming)
 - [SOLID](<https://en.wikipedia.org/wiki/SOLID_(object-oriented_design)>)
 - [TDD](https://en.wikipedia.org/wiki/Test-driven_development)
-= [Two Generals' Problem](https://en.wikipedia.org/wiki/Two_Generals%27_Problem)
+- [Two Generals' Problem](https://en.wikipedia.org/wiki/Two_Generals%27_Problem)
 - [YAGNI](https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it)

 ## My other lists