How the OOM-Killer Deleted My Namespace, and Other Kubernetes Tales

Laurent Bernaille at KubeCon + CloudNativeCon North America 2020

Running Kubernetes at scale is challenging and you can often end up in situations where you have to debug complex and unexpected issues. This requires understanding in detail how the different components work and interact with each other. Over the last 3 years, Datadog migrated most of its workloads to Kubernetes and now manages dozens of clusters consisting of thousands of nodes each. During this journey, engineers have debugged complex issues with root causes that were sometimes very surprising. In this talk Laurent and Tabitha will share some of these stories, including a favorite: how a complex interaction between familiar Kubernetes components allowed an OOM-killer invocation to trigger the deletion of a namespace.