Mehul Soni
Senior AI Research Engineer, Enterprise AI Research,
The Vanguard Group

ABOUT THE SPEAKER:

Mehul is a Senior AI Engineer at Vanguard, specializing in building enterprise-scale LLM and agentic AI systems that bridge applied research and production impact. She brings over five years of industry experience applying AI/ML techniques to solve complex business problems. Her work spans LLM post-training, multi-agent systems, evaluation frameworks, and AI systems engineering, with a strong emphasis on translating cutting-edge research into scalable, production-ready solutions. Mehul is actively engaged in AI and professional communities, contributing to initiatives that promote mentorship and inclusive growth, such as Women in Data Science.

TALK TITLE:

Scaling Production-Grade LLMs: Diagnosing Hidden Bottlenecks in Training and Inference Systems

TRACK:

Technical / Engineering Talks

SUB TOPIC:

Inference Serving & Optimization

ABSTRACT:

While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.

In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.

Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.

WHAT YOU’LL LEARN:

TBA

Who Attends

Attendees
0 +
Data Practitioners
0 %
Researchers/Academics
0 %
Business Leaders
0 %

2023 Event Demographics

Technical practitioners working directly with ML/AI systems
0 %
Currently Working in Industry*
0 %
Attendees Looking for Solutions
0 %
Currently Hiring
0 %
Attendees Actively Job-Searching
0 %

2023 Technical Background

Expert/Researcher
14%
Advanced
37%
Intermediate
28%
Beginner
7%

2023 Attendees & Thought Leadership

Attendees
0 +
Speakers
0 +
Company Sponsors
0 +

Business Leaders: C-Level Executives, Project Managers, and Product Owners will get to explore best practices, methodologies, principles, and practices for achieving ROI.

Engineers, Researchers, Data Practitioners: Will get a better understanding of the challenges, solutions, and ideas being offered via breakouts & workshops on Natural Language Processing, Neural Nets, Reinforcement Learning, Generative Adversarial Networks (GANs), Evolution Strategies, AutoML, and more.

Job Seekers: Will have the opportunity to network virtually and meet over 30+ Top Al Companies.

Ignite what is an Ignite Talk?

Ignite is an innovative and fast-paced style used to deliver a concise presentation.

During an Ignite Talk, presenters discuss their research using 20 image-centric slides which automatically advance every 15 seconds.

The result is a fun and engaging five-minute presentation.

You can see all our speakers and full agenda here

Get our official conference app
For Blackberry or Windows Phone, Click here
For feature details, visit Whova