LISA18 Session notes – Failure at Netflix velocity

One of the big themes at LISA18 was the fast emerging discipline of Site Reliability Engineering (SRE). This approach to resiliency was born out of hyper-scale consumer cloud apps like Google, Facebook and Netflix but the principles are starting to resonate across the industry.

Failure at Netflix velocity, by Dave Hahn, Netflix

Netflix is big, really big.  The scale of their infrastructure is in the order of 100K+ EC2 instances, with 1000s of changes everyday, streaming 10TB/s video, with 1 goal – “winning moments of truth with their customers”. Essentially, Netflix believe that they are in competition for a split second decision on a users entertainment choice – competing with playing game, reading a book, watching tv, etc etc.

When Netflix decided to move to AWS, they rearchitected their entire system to take advantage of the cloud and normalize failure by randomly killing instances (“their Chaos Monkey” approach – fairly well known in the industry, and a key tenet in the emerging “Chaos Engineering” discipline). This was good at improving resiliency from instance failures, but another problem is ‘grey failures’ – essentially issues that were intermittent, incomplete requests, or slowdowns. They built another tool called “Latency Monkey” that allows their SRE team to introduce synthetic latency into the environment.

More broadly, Netflix has key principles to ensuring resiliency:

  • Reasonable Prevention – don’t overindex on past failures as by the time major failures happen, there is an alignment of 198 dominoes that all fell at the right time to cause thie issues. At the same time, don’t overindex on future perceived failures – it blinds the team to the easy opportunities right in front of the team
  • Invest in resilience – Resiliency must be a cognizant choice and should be codified by good application patterns
  • Introduce chaos – consider SREs like a “security red team, but for reliability” – this helps remind the team of the importance of resiliency, without waiting for a compelling event
  • Expect failures – failure is a when, not an if. Improving recovery is a more important strategy than prevention, but it is hard to predict all scenarios

And principles on incident management:

  • SHORT – incidents should be as short and shallow as possible. It is obvious, but needs to be said because it needs to be measured
  • UNIQUE – Every outage should be new and exciting. If incidents are repetitive and boring, it means the team isn’t learning from past issues and taking steps to solve.
  • VALUABLE – Outages are expensive, so the team must extract all the value they can.

The role of SRE at Netflix

While DevOps seeks to institutionalize good ops feedback loops, there is still a need for “domain experts in failure”.

  • Before Incidents, SREs focus on training teams on how to function on-call, ensure teams know the key metrics, know to rely on the broader team etc
  • During incidents they are responsible for coordination and communications
  • After incidents, their job is to help memorialize the issues, seek input from all stakeholders and capture user impact stories. Understand the chain of events that led to the incident and conduct true blameless incident reviews

One interesting philosophical angle, I took out of this presentation – issues should be memorable and treated like folklore, fairy tales and fables – so that the cautionary tales are remembered by the organization. Incidents aren’t problems to be solved, they are dilemmas that need to be constantly mitigated and incrementally improved through process.