Scaling to Millions of ML Models to Solve the Problems of SRE and Security

Jakub Pavlik, Sandeep Pombra at KubeCon + CloudNativeCon North America 2020

This talk describes how to scale to millions of ML models operating on petabytes of operational and user data that is used to improve the efficacy of SRE teams and security of end users’ application services. These models are used to improve zero trust security framework and infrastructure diagnosis -- based on machine learning, anomaly detection and time series analysis. A production deployment that delivers these large numbers of models combines many open source technologies such as Kubernetes, Prometheus, Cortex, Apache Spark and Apache Arrow. In this talk, we will describe the key challenges that we had to solve when implementing machine learning and anomaly detection on K8s nodes and Envoy-based service mesh. These challenges include collecting data from hundreds of thousands of nodes, high cardinality of models, and distributing the inference models down to each of the K8s nodes.