Analyzing Operational Data at Scale Using ML at Intuit

Amit Kalamkar, Vigith Maurice at KubeCon + CloudNativeCon North America 2020

Intuit is running ~2000 services in preprod and prod on Kubernetes which needs a fast and easy way to detect and isolate problems via automated analysis of operational data. In this talk, we will present how we collect, analyze and present real-time operational data using an Operational Data Lake built using stream processing and efficient data structures and cache. It is a warehouse for clean, documented, and schematized operational data based on different cloud infrastructure, kubernetes and application events and metrics. In particular, we will show how we use simple ML techniques to automatically detect anomalies in impacted services as well as the causal service . We will also demo how this helps us quickly detect and identify production issues.