diff --git a/README.md b/README.md index 2ff4509..4bc5008 100644 --- a/README.md +++ b/README.md @@ -742,6 +742,21 @@ Rob Pike, [Go at Google: Language Design in the Service of Software Engineering] - Wood's theorem: As the complexity of a system increases, the accuracy of any single agent’s own model of that system decreases rapidly. - The more tools and code that you add to create elements in a system, the harder it is to replicate an environment encompassing those tools and code. - At the core of testing in production is the idea of splitting deployments (of artifacts) from releases (of features). +- [Testing in Production: the hard parts](https://medium.com/@copyconstruct/testing-in-production-the-hard-parts-3f06cefaf592), Cindy Sridharan + - The whole point of [actual] distributed systems engineering is you assume you’re going to fail at some point in time and you design the system in such a way that the damage, at each point is minimized, that recovery is quick, and that the risk is acceptably balanced with cost. + - How can you cut the blast radius for a similar event in half? + - Differentiate between deployment (0 risk) and release + - Build a deploy-observe-release pipeline + - Make incremental rollouts the norm (canaries, %-based rollouts, etc.) + - Test configuration changes just like you test code + - Default to roll back, avoid fixing forward (slow!) + - Eliminate gray failures - prefer crashing to degrading in certain cases + - Prefer loosely coupled services at the expense of latency or correctness + - Use poison tasters (isolate handling of client input) + - Implement per-request-class backpressure + - Have proper visibility from a client/end-user standpoint (client-side metrics) + + ### Security