Abstract:MLOps hurdles don’t end after models are pushed to production. In the ML lifecycle, inference workloads present a critical challenge, where throughput and latency become key measures
MLOps hurdles don’t end after models are pushed to production. In the ML lifecycle, inference workloads present a critical challenge, where throughput and latency become key measures and teams struggle to meet efficient GPU utilization levels. In this workshop applicable to both Data Scientists and ML Engineers, Guy Salton will give an overview of the challenges in moving ML prototypes to production, and how best-in-class ML teams are successfully overcoming these hurdles.
We’ll discuss using fractional GPU capabilities to improve throughput and reduce latency, and we’ll show how one organization built an inference platform on top of Kubernetes with the NVIDIA A100 MIG to support their scaling AI initiatives. There are very few organizations using the new NVIDIA MIG functionality successfully, so even if you’re not using A100s yet, this is a unique opportunity to see how the MIG works for inference use cases.
What You’ll Learn:
How to use MPS (Multi-Process Services) with fractional GPU to increase throughput
Dynamic MIG functionality – how to get dynamic MIG slices for each new job when using the NVIDIA A100 GPU
Additional capabilities to run Inference on top of Kubernetes
Guy Salton is the Solutions Engineering Lead at Run:AI, specializing in the fields of DevOps, Cloud Computing, Kubernetes, Containers, Virtualization, CI/CD and AI computing. He runs POCs and technical projects for our commercial and enterprise customers, including on-site installations and workshops. Guy speaks at conferences and meetups around the world, writes blog posts and delivers webinars.
(Monday) 10:00 AM - 11:00 PM