Researchers, practitioners, and industry leaders sharing practical lessons from real AI and ML work.
One virtual day, two in-person days of keynotes and technical sessions, and one dedicated workshop day.
TMLS brings together voices from industry and research to share real-world lessons in machine learning, AI infrastructure, enterprise adoption, and applied AI. From keynote sessions to technical talks, the program is built for people looking to learn from work that is grounded in practice.
Jump to:
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Rajiv Shah is the Agentic AI Engineer at OpenHands with a passion and expertise in Practical AI. He focuses on enabling enterprise teams to succeed with AI. Rajiv has worked on GTM teams at leading AI companies, including Hugging Face in open-source AI, Contextual AI in context engineering, Snorkel in data-centric AI, Snowflake in cloud computing, and DataRobot in AutoML. He started his career in data science at State Farm and Caterpillar.
Rajiv is a widely recognized speaker on AI, published over 20 research papers, been cited over 1000 times, and received over 20 patents. His recent work in AI covers topics such as sports analytics, deep learning, and interpretability.
Rajiv holds a PhD in Communications and a Juris Doctor from the University of Illinois at Urbana Champaign. While earning his degrees, he received a fellowship in Digital Government from the John F. Kennedy School of Government at Harvard University. He is well known on social media with his short videos, @rajistics, with over 100k followers.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Many coding agents fail in production not because of model limitations, but because the surrounding harness is poorly engineered. The harness, which governs how agents retrieve context, manage state, and interact with tools, often introduces failure modes such as irrelevant context, inconsistent memory, and brittle execution.
This session addresses the practical gap between model capability and system reliability. Practitioners frequently encounter agents that perform well in demos but degrade in real workflows due to issues like context overload, uncontrolled state growth, and unstructured tool use. These problems lead to higher costs, hallucinations, and unpredictable behavior.
We focus on the core technical challenge of harness design: how to structure retrieval, memory, and execution loops so that agents remain grounded, efficient, and reliable over longer tasks. By making these design choices explicit, this session helps practitioners move from prototype agents to production-grade systems that behave consistently under real-world constraints.
WHAT YOU’LL LEARN:
This is more a technical deep dive, than research, so maybe this isn’t right
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m a AI Strategist at Teradata supporting Financial customers in US/Canada. I am data scientist and technologist who work on creating interesting solutions that helps create business outcomes for our customers. Before Teradata, I worked for various startups supporting customers in forward engineering roles. I have also had several cofounding member roles in companies and hold several patents across various domains. I live in the Silicon valley bay area.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Financial institutions often collect events from various touch points such as call transcripts, chatbots, branch visits, transactions etc., and often struggle to describe the ‘customer task’ or ‘paths leading to interesting outcomes’ from those. Knowing this, allows the businesses to understand intent behind user behavior and provide better services/offers and customer retention. While these discrete event sequences are not exactly NLP, they do have a vocabulary of their own along with timestamps. This talk describes how to build various white box and deep learning transformers/generative models and addresses tradeoffs across range of accuracy/ explainabiity/ inferencing complexity. This way businesses can pick and choose the models depending on regulatory and non-regulatory use cases and achieve the same objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
TECHNICAL LEVEL:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
Humans and Continual Learning AI Agents: The Journey
Optimizing Vector Search: Why You Should Flatten Structured Data. An Analysis of How Flattening Structured Data Can Boost Precision and Recall by Up to 20%
Jailbreaking the Blockchain: How I Used Game Theory to Map Prompt Injection Attack Surfaces in Agentic Systems
INSPIRE: Intent-aware Neural Sponsored Product Retrieval for E-commerce
The Vicious Loop: Why Stateless Agents Fail in Production and How We Built Episodic Memory to Fix It
AI Agents from Experiment to Institutional Capabilities
Deploying with Purpose: Embedding Economic Evaluation Across the AI Lifecycle
The Meaning Gap: Why Your Agent Is Right and Your Deployment Can Be Wrong from $62M Write-Offs to Life-Saving AI Systems in Production — The Human Operating Model that Makes AI Stick.
From Day 2 to Day 10: Operationalizing Evals for Real-World LLM Systems
Leading Trustworthy AI Engineering in Legal: Alignment, Trade-offs, and the Glue That Holds It Together
Reasoning Robots: Open World Navigation and Memory for Agentic Robots
Pre-RFP Pension Fund Prospect Ranking: Proxy Targets on Noisy Mandate Data, LLM-Assisted Research, and Human-in-the-Loop Coverage
Squeezing More Juice Out of Your LLM API: Performance Optimizations and How to Leverage Them
RESEARCH ENGINEER, LEAD HUGGING FACE, SMOLLM
SmolLM: The Rise of Smol Models
Browse the full summit agenda, including virtual sessions, in-person talks, keynotes, and workshops. Use the embedded schedule below to explore sessions, speakers, and timing across the event.
Loading...
TMLS is Canada’s flagship summit for applied ML, AI infrastructure, and enterprise adoption. We bring together the researchers, practitioners, and leaders putting AI into practice across Canada. If you have real lessons, practical wins, or important research to share, we’d love to hear from you.
We’re looking for talks grounded in real work, from production systems and implementation challenges to research that helps the community understand what matters now and what comes next.
Business Leaders: C-Level Executives, Project Managers, and Product Owners will get to explore best practices, methodologies, principles, and practices for achieving ROI.
Engineers, Researchers, Data Practitioners: Will get a better understanding of the challenges, solutions, and ideas being offered via breakouts & workshops on Natural Language Processing, Neural Nets, Reinforcement Learning, Generative Adversarial Networks (GANs), Evolution Strategies, AutoML, and more.
Job Seekers: Will have the opportunity to network virtually and meet over 30+ Top Al Companies.
Ignite what is an Ignite Talk?
Ignite is an innovative and fast-paced style used to deliver a concise presentation.
During an Ignite Talk, presenters discuss their research using 20 image-centric slides which automatically advance every 15 seconds.
The result is a fun and engaging five-minute presentation.
You can see all our speakers and full agenda here