Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing

Dan Sun, David Goodwin at KubeCon + CloudNativeCon North America 2020

Large-scale language models, such as BERT and GPT-2, have brought exciting leaps in state-of-the-art accuracy for many NLP tasks. BERT requires significant compute during inference, which poses challenges for real-time application performance. KFServing provides a simple model serving interface across common model servers with a standardized REST/gRPC inference protocol to serve single or co-located multiple models on CPU or GPU. KFServing enables hardware acceleration and autoscaling of Bloomberg's own BERT models trained on a corpora of specialized, financial news data. In this talk, we will discuss how we use KFServing in a production application to address scalability, latency, and throughput with Knativeā€™s Autoscaler and Activator. We will also discuss some performance debugging tips and show the GPU benchmark results with TensorFlow/PyTorch BERT models deployed to KFServing.