Whatever Can Go Wrong, Will Go Wrong – Rook/Ceph and Storage Failures

Sagy Volkov at KubeCon + CloudNativeCon North America 2020

Imagine running a 200-node Kubernetes cluster, and suddenly you lost a node or even a ToR switch. What is the state of your persistent storage that your application relies on? How can you make sure your storage is always available? How can you time and plan how long it takes for your storage to get back to 100% resiliency? In this presentation we’ll go over the basics of storage demands (RPO/RTO), How different types of replications in Ceph impact our recovery time, and how components failure such as drive, node or cluster determine how long we are at risk. We'll include a live demo of a Rook/Ceph recovery process from a failed component. We'll show what components of Rook are recreated, how Ceph behaves during components/pods recreation, and what is the impact on the application while these failures occur (In our case the application will be MariaDB).

https://sched.co/ekFH