An SLO-Driven Approach to Enhance Kubernetes Cluster Reliability

Cong Chen, Qian Ding at KubeCon + CloudNativeCon North America 2020

How to define reliability of a Kubernetes cluster? What are the SLOs? How many 9s is enough to ensure end-users are happy for a Kubernetes cluster with thousands of nodes? Service-level-objective (SLO) is the key to run large-scale production cluster reliably. Defining SLOs for classic web services is simple, since web requests are served synchronously with distinct status code. On the contrast, defining SLOs for Kubernetes services is obscured due to its intent-oriented design and declarative APIs. This talk first briefs the philosophy behind the SLO-driven approach for reliability engineering, followed by a deep dive of how SREs define SLOs for one of the world largest Kubernetes cluster in Ant Financial. Finally this talk shares concrete cases and lessons learned of building SLOs framework from several perspectives, including monitoring, alerting and tracing.