Last week, I had the pleasure of visiting Nashville, Tennessee, for the Usenix LISA18 conference. Its always refreshing to take a break from the Microsoft conference circuit and spend some time learning from the wider industry. I try to travel light at conferences like this, relying on pen and paper – but over the next few days will type up my notes and put on this blog… as a open reminder of key topics.
SLO Burn—Reducing Alert Fatigue and Maintenance Cost in Systems of Any Size by James Wilkinson, Google.
My takeaways:
Key point of the session is that the approach to operations alerting and oncall processes is broken. We should focus instead on “Service Level Objectives” (SLO) and rules to determine when service issues are important enough to justify interrupting normal operations.
The cost of maintenance must scale sublinearly with the growth of a system. Diagnostics and monitoring tools can become their own form of technical debt – often there is no time to maintain monitoring tools, as an ops team is too busy maintaining the system itself. When a system is constantly evolving, often alerts get out of date and cause false positives and contribute more noise to the system.
This isn’t to say tooling isn’t important – but observability is an emerging property of the system, it isn’t static or a problem that can be solved (it’s a dilemma that needs to be mitigated and institutionalized in process). There are key tools to build out observability:
- Logs – preformatted events
- Metrics – aggregated events
- Traces – events in a tree, essentially related by session ID
- Exceptions/Stack traces – these are ‘extinction level’ events that mean major things have gone wrong.
Time spent in improving observability is better invested in improving debug tools, rather than alerting tools.
Instead of log alerting, it is better to set the bar for how much failure is acceptable (the ‘error budget’), before an oncall ‘page’ is triggered, defined via SLI, SLO and SLAs. For the SRE noobs, here is a quick refresher:
- SLI (Service Level Indicator) – this is the measure
- SLO (Service Level Objective) – this is the goal for what the SLI should be
- SLA (Service Level Agreement) – this is the agreement/incentive to meet the SLO
There are different ways of setting objectives – for instance, a traditional uptime metric is uptime / total time. For other services like APIs, functions etc, a cloud availability metric may look more like successful requests / total requests. Outside in probes are not always great – they are good at showing when something goes wrong, but harder to quantify successful requests, for instance. Google’s approach is to measure SLOs at the load balancer level – so they can track # of requests coming in.
SLOs are a negotiation – and will vary by service and by team. Often, SLOs and SLAs are monthly, so measuring the “Burn rate” of the error budget (ie, using up the error budget at a rate faster than the error budget allows – eg, 70 errors a second). A ‘Fast Burn’ alert could be set. Setting the alert level is also a negotiation – set it too high and there will be too many alerts, too much interruption to normal operations. Set the alert level too low and support tickets will go up.
Finally, there is a philosophical point about being oncall – oncall is not a role requirement for SRE/ops teams, but I is a tool or improving the product.
While the session wasn’t recorded, James delivered a similar session at SRECon Asia/Australia earlier in the year.