Abstract:As datasets and model sizes grow, training takes longer and longer, but scaling such systems doesn’t need to be scary. Imagine if training jobs that previously took
As datasets and model sizes grow, training takes longer and longer, but scaling such systems doesn’t need to be scary. Imagine if training jobs that previously took days instead took hours—and for roughly the same cost! Distributed training enables you to do this by splitting your model training across many instances in a cluster. Off-the-shelf tooling has improved over the last 1-2 years such that this process has become easy, and most of the challenges are in simply making the right design decisions up-front. It’s important to also understand how research problems change when scaling up, e.g., the need to tune hyperparameters such as learning rate and batch size.
What You’ll Learn:
You’ll learn how to make design decisions early on in model development to make it easy to switch to training at scale. Tools and platforms such as PyTorch Lightning, Hydra, and Azure ML make this process easy. Since the research problem changes slightly on a cluster vs a single node, we’ll discuss ways to think about the associated trade-offs in network latency, hyperparameters, and compute type. We’ll even cover basic performance profiling to understand bottlenecks.
Doug is currently building an end-to-end ML platform currently for computer vision edge devices at Axon, maker of the Taser stun gun. Doug was formerly the CTO at Passenger AI (YC S18), which was acquired by Zippin Inc. There, he led a team of engineers developing self-driving car interior monitoring solutions that were sold to Fortune 20 companies. Doug lived in the San Francisco Bay Area for several years where he worked as a software engineer at companies like Mozilla and Zynga. MSc in Computer Science with a Machine Learning Specialization at Georgia Tech, BASc in Mechatronics Engineering at the University of Waterloo.
(Thursday) 12:10 PM - 12:40 PM