Researchers, practitioners, and industry leaders sharing practical lessons from real AI and ML work.
One virtual day, two in-person days of keynotes and technical sessions, and one dedicated workshop day.
TMLS brings together voices from industry and research to share real-world lessons in machine learning, AI infrastructure, enterprise adoption, and applied AI. From keynote sessions to technical talks, the program is built for people looking to learn from work that is grounded in practice.
Jump to:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dawn Song is a Professor in Computer Science at UC Berkeley and Co-Director of Berkeley Center for Responsible Decentralized Intelligence. Her research interest lies in AI safety and security, Agentic AI, deep learning, security and privacy, and decentralization technology. She is the recipient of numerous awards including the MacArthur Fellowship, the Guggenheim Fellowship, the NSF CAREER Award, the Alfred P. Sloan Research Fellowship, the MIT Technology Review TR-35 Award, ACM SIGSAC Outstanding Innovation Award, and more than 10 Test-of-Time Awards and Best Paper Awards from top conferences in Computer Security and Deep Learning. She has been recognized as Most Influential Scholar (AMiner Award), for being the most cited scholar in computer security. She is an ACM Fellow and an IEEE Fellow, and an Elected Member of American Academy of Arts and Sciences. She obtained her Ph.D. degree from UC Berkeley. She is also a serial entrepreneur and has been named on the Female Founder 100 List by Inc. and Wired25 List of Innovators.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Ion Stoica is a Professor in the EECS Department and holds the Xu Bao Chancellor Chair at the University of California, Berkeley. He is the Director of the Sky Computing Lab and the Executive Chairman of Databricks and Anyscale. His current research focuses on AI systems and cloud computing, and his work includes numerous open-source projects such as vLLM, SGLang, Chatbot Arena, SkyPilot, Ray, and Apache Spark. He is a Member of the National Academy of Engineering, an Honorary Member of the Romanian Academy, and an ACM Fellow. He has also co-founded several companies, including LMArena (2025), Anyscale (2019), Databricks (2013), and Conviva (2006).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
From 2018 to 2026, Manuela Veloso was the founder and Head of JPMorganChase AI Research & Herbert A. Simon University Professor Emerita at Carnegie Mellon University, where she was faculty in the Computer Science Department and then Head of the Machine Learning Department.
Veloso has a licenciatura degree in Electrical Engineering and an M.Sc. in Electrical and Computer Engineering from Instituto Superior Técnico, Lisbon, an M.A. in Computer Science from Boston University, and a Ph.D. in Computer Science from Carnegie Mellon University. Veloso has Doctorate Honoris Causa degrees from the Örebro University, Sweden, the Instituto Universitário de Lisboa (ISCTE), Portugal, the Université de Bordeaux, France, and the Universidade Católica of Portugal.
She served as president of the Association for the Advancement of Artificial Intelligence (AAAI), and she is co-founder and a Past President of the RoboCup Federation. She is a fellow of main professional organizations in her area, namely AAAI, IEEE, AAAS, and ACM. She is the recipient of the ACM/SIGART Autonomous Agents Research Award, the Einstein Chair of the Chinese Academy of Sciences, an NSF Career Award, and the Allen Newell Medal for Excellence in Research. Veloso is a member of the National Academy of Engineering with a citation “for contributions to artificial intelligence and its applications in robotics and the financial services industry.” She is also a member of the Academy of Sciences of Portugal.
Her research interests are in AI, including Autonomous Robots, Multiagent Systems, Continual Learning Agents, and AI in Finance. For further details, see www.cs.cmu.edu/~mmv.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
I will talk about AI agents, and multiagent systems, in particular. I will focus on the agent’s perception as the robust processing and sharing of information, the agent’s cognition as their planning and memory-based reasoning abilities, and the agent’s action as the capabilities to execute in their environment. While AI has the potential to assist humans with many tasks, the future aims at a seamless integration of humans and AI with AI agents able to collaborate and continuously learn. The talk will include examples of robot and digital agents.
WHAT YOU’LL LEARN:
AI Agents have limitations, they rely on other agents and humans to improve their performance over time.
ABOUT THE SPEAKER:
Freddy Lecue is a Managing Director and Head of Frontier AI Model Methodology at Wells Fargo, where he architects and scales Generative AI, agentic AI, and advanced machine learning models for enterprise production, while balancing performance, latency, cost, and risk.
He leads the firm’s AI research agenda, elevates modeling standards through targeted training, and establishes best-practice frameworks to enhance robustness, scalability, and model validation. Freddy also drives AI-enabled transformation across the end-to-end model lifecycle, including development, documentation, testing, and validation.
Prior to Wells Fargo, he held senior AI leadership roles at JPMorgan Chase, Thales Canada, Accenture Ireland, and IBM Ireland. He holds a Ph.D. in Computer Science and is based in New York City.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
TBA
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Armando Benitez is the Chief Data & Analytics Officer (CDAO) and Head of AI at BMO Capital Markets. He leads a team of engineers, strategists, and AI professionals who create end-to-end solutions at the intersection of Finance and Technology.
As CDAO, Armando shapes the strategic vision for data and analytics, integrating AI into business processes to drive innovation and improve decision-making. His leadership promotes data-driven insights and aligns technological initiatives with business goals.
Armando joined BMO’s ETF desk in 2016 after working on data products for fraud detection and recommender systems at Paytm. With a background in High Energy Physics, he brings a unique perspective to the team.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
AI agents are moving rapidly from experimental prototypes to production systems embedded in critical business workflows. In regulated environments such as capital markets, deploying agents requires more than model performance. It requires governance, reliability, human oversight, and a clear path to measurable value.
WHAT YOU’LL LEARN:
We will discuss architectural patterns, governance frameworks, and operational lessons learned from deploying agents that interact with real data, real clients, and real risk.
ABOUT THE SPEAKER:
Dr. Mamdani is Clinical Lead – Artificial Intelligence at Ontario Health and Director of the University of Toronto Temerty Faculty of Medicine Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM). Previously, Dr. Mamdani was Vice President of Data Science and Advanced Analytics at Unity Health Toronto where his team deployed over 50 AI solutions to improve patient outcomes and hospital efficiency. Dr. Mamdani is also Professor in the Department of Medicine of the Temerty Faculty of Medicine, the Leslie Dan Faculty of Pharmacy, and the Institute of Health Policy, Management and Evaluation of the Dalla Lana School of Public Health. He is also an Affiliate Scientist at IC/ES and a Faculty Affiliate of the Vector Institute. In 2024, Dr. Mamdani’s team received the national Solventum Health Care Innovation Team Award by the Canadian College of Health Leaders. Also in 2024, Dr. Mamdani was named international AI Leader of the Year by AIMed. Previously, Dr. Mamdani was named among Canada’s Top 40 under 40. He has published over 600 studies in peer-reviewed medical journals. Dr. Mamdani obtained a Doctor of Pharmacy degree (PharmD) from the University of Michigan (Ann Arbor) and completed a fellowship in pharmacoeconomics and outcomes research at the Detroit Medical Center. During his fellowship, Dr. Mamdani obtained a Master of Arts degree in Economics from Wayne State University with a concentration in econometric theory. He then completed a Master of Public Health degree from Harvard University with a concentration in quantitative methods.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Artificial intelligence has the potential to transform healthcare yet its adoption has been slow. This presentation will review the potential for AI in healthcare using real world examples and discuss the challenges in its adoption.
WHAT YOU’LL LEARN:
TBA
ABOUT THE SPEAKER:
Prof. Steven Waslander is a leading authority on autonomous robotics, including self-driving cars and multirotor drones. He received his B.Sc.E.in 1998 from Queen’s University, his M.S. in 2002 and his Ph.D. in 2007, both from Stanford University in Aeronautics and Astronautics. He was recruited to the University of Waterloo from Stanford in 2008, where he led the Autonomoose project, the first self-driving car to be tested on public roads by a Canadian university. In 2018, he joined the University of Toronto Institute for Aerospace Studies (UTIAS), and founded the Toronto Robotics and Artificial Intelligence Laboratory (TRAILab).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic reasoning for robots is rapidly becoming a reality, allowing flexible natural language interaction with human operators and enabling a wide range of navigation, object handling and recall tasks in a variety of settings. In this talk, Prof. Waslander will discuss the ongoing efforts in his lab to make useful agentic robots for the warehouse and outdoor settings, by integrating open world perception with agentic reasoning for reliable open world navigation, and by adding multi-faceted memory – spatial, descriptive and visual – to enable experience recall for temporal question answering. Together, these advances allow a wide variety of spatial, semantic, functional and temporal tasks to be completed by robots without any fine-tuning to specific domains.
WHAT YOU’LL LEARN:
Scaffolding around agent needed to make spatial intelligence possible, big gap between primary LLM /MLLM uses and robotics, lots to explore.
ABOUT THE SPEAKER:
I am a Senior Software Engineer with 12 years of experience, specializing in AI Infrastructure and Semantic Search. I am currently building a semantic search platform, managing high-throughput embedding pipelines, and orchestrating vector databases on Kubernetes. As an author on Towards Data Science, I share practical, empirical strategies for vector search optimization.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
When integrating structured data into a RAG system, engineers often default to embedding raw JSON into a vector database. The reality, however, is that this intuitive approach leads to dramatically poor retrieval performance. Modern embeddings leverage BERT architectures optimized for natural language, which struggle with the high frequency of non-alphanumeric characters found in JSON syntax.
In this session, I will break down the exact failures of embedding structured data—from tokenization and attention mechanism disruption to the mathematical liability of Mean Pooling on syntax tokens. I will then demonstrate a practical, production-ready solution: implementing a simple preprocessing step to convert structured JSON into natural language templates. Backed by empirical testing on the Amazon ESCI dataset, I will show how this straightforward architectural shift natively boosts Recall@10 by over 19% and MRR by 27%.
Note: This article was recently featured as a top weekly article on Towards Data Science, reaching over 9,000 views.
WHAT YOU’LL LEARN:
The analysis confirms that embedding raw structured data into generic vector space is a suboptimal approach and adding a simple preprocessing step of flattening structured data consistently delivers significant improvement for retrieval metrics (boosting recall@k and precision@k by about 20%). The main takeaway for engineers building RAG systems is that effective data preparation is extremely important for achieving peak performance of the semantic retrieval/RAG system.
ABOUT THE SPEAKER:
I’m a Senior Security Engineer at Ripple and an Executive MBA graduate of Cornell SC Johnson College of Business, with over 10 years of experience securing large-scale cloud and blockchain infrastructure at organizations including Google, Nike, eBay, and Cisco.
My research sits at the intersection of machine learning and adversarial security — spanning game-theoretic prompt injection frameworks, zero-knowledge ML for blind signing prevention, federated threat detection, and RAG-based cybersecurity systems. I’ve published work across IEEE venues on LLM security, neuromorphic computing, and decentralized trust models.
At Ripple, I lead security engineering across multi-cloud environments (AWS, Azure, GCP), applying ML-driven automation to threat detection, incident response, and compliance at blockchain payments scale. My practitioner perspective bridges the gap between theoretical ML security research and the operational realities of deploying AI in high-stakes financial infrastructure.
I hold AWS Certified Security Specialty and CISM certifications, serve as an Associate Fund Manager with Big Red Ventures at Cornell, and am an active TPC reviewer for IEEE conferences. I speak and write on the security implications of decentralization — including the tension between trustless systems and centralized control vectors that make blockchain environments uniquely vulnerable.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most AI agent evaluation frameworks are built to measure capability, not adversarial robustness. The result: agents that ace benchmarks but collapse under real-world attack conditions — prompt injection, goal hijacking, tool misuse, and output trust exploitation that standard evals never surface.
Drawing on research developing game-theoretic prompt injection frameworks and zero-knowledge ML systems for high-stakes financial infrastructure at Ripple, this session presents a concrete methodology for adversarial agent evaluation that practitioners can apply to their own pipelines.
You’ll leave with a working model for mapping your agent’s attack surface across orchestration logic, tool boundaries, and downstream trust chains — and a set of design patterns for building agents that are auditable and adversarially robust by architecture, not by accident. Whether you’re building coding agents, deploying LLM workflows in production, or responsible for governing AI systems at scale, this session reframes how you think about evaluation before your next deployment.
WHAT YOU’LL LEARN:
Agent vulnerability is primarily architectural, not a model alignment problem — fixing the model without addressing orchestration logic leaves the most exploitable attack surface untouched
ABOUT THE SPEAKER:
I am currently a Machine Learning Scientist in the Sponsored Products Search team at Walmart that is responsible for powering the advertising technlogy for Walmart’s e-commerce platform. My work spans the domain of semantic query and item understanding, retrieval (traditional IR and neural networks), ranking, and ad auction and monetization. Apart from product dev, I work on applied research. Recently, I got a paper accepted at SIGIR 2026, Industry track: https://arxiv.org/pdf/2604.07930
Prior to that, I was a computer vision scientist at Walmart’s Intelligent Retail Lab (IRL). My work in the retail space focused on scene understanding and fine-grained image retrieval, where I developed solutions that significantly reduce shrinkage at self-checkout systems (SCO), leading to millions of dollars in recovered revenue. These systems utilize multiple sensor inputs, including computer vision cameras, weight sensors, RFID, hand scanners, and barcode readers, to streamline and enhance the checkout process. I was recognized with the prestigious Impact Award for driving extraordinary contributions by proactively identifying critical improvement areas, implementing innovative solutions, and delivering exceptional business results.
Prior to joining Walmart, I worked at Orbital Insight where I worked on development of multi-class object detectors to identify ships, aircraft, and armored vehicles from satellite imagery, supporting strategic intelligence initiatives. My experience also extends to environmental monitoring, where I worked on land use change detection algorithms. My research in geospatial computer vision includes authoring a paper on “Active Learning for Improved Semi-Supervised Semantic Segmentation in Satellite Images,” accepted at WACV 2022.
Earlier, I pursued my master’s research at UMass Amherst, working with Professor Madalina Fiterau. During my time at UMass Amherst, I co-authored a paper titled “Pedestrian Detection in Thermal Images Using Saliency Maps,” published in the CVPR 2019 workshop.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Walmart holds the largest share of the U.S. e-commerce grocery market, where food and beverage categories generate some of the highest search traffic and, consequently, drive a substantial portion of sponsored search revenue. At this scale, even small mismatches between user intent and retrieved products can lead to significant losses in both user engagement and monetization. Yet, understanding user intent in grocery search is inherently challenging. Queries are typically short, ambiguous, and highly diverse, often underspecifying critical preferences.
For example, a query like schar white bread implicitly encodes a gluten-free preference through brand association, while queries such as chickpea pasta or oatmilk reflect underlying dietary preferences like gluten-free, plant based, or lactose-free alternatives. Failing to capture these signals results in retrieving products that might be semantically similar but misaligned with the user’s true needs.
From the advertiser’s perspective, many products are explicitly designed to target specific intents—such as dietary preferences or size variants—and must be surfaced at the right moment to be effective. For example, a brand like Quest Nutrition, which sells high-protein, low-sugar snacks, wants its products to appear for queries like protein bars, low carb snacks, or keto snacks, even when these attributes might not be explicitly stated in the product title text. When retrieval systems fail to capture these intent signals, relevant products are not shown to the right users at the right time. From an advertiser’s perspective, this means their products are missing high-intent opportunities where conversion is most likely. Over time, this leads to lower returns on ad spend, reduced trust in the platform, and potential advertiser attrition. Losing advertisers directly translates to a loss in advertising revenue and weakens the overall sponsored search ecosystem. This challenge is further amplified in sponsored search, where only a limited number of ad slots are available, making precise relevance essential. Thus, we propose INSPIRE (Intent-aware Neural Sponsored Product Retrieval for E-commerce), an intent aware retrieval framework for sponsored search that leverages structured intent signals to better align user queries with relevant food and beverage products. INSPIRE represents intent as a set of structured, multi-dimensional attributes derived from both user queries and product content, capturing explicit signals (e.g., brand, flavor) as well as implicit preferences (e.g., dietary constraints, cuisine types) that are often not directly expressed in queries.
We develop a weakly supervised intent learning pipeline, where a large language model serves as a teacher to generate structured intent annotations from product titles and descriptions. We then distill these annotations by using them to finetune a lightweight student LLM model through LoRA based supervised finetuning (LoRA-SFT) that predicts intent attributes—such as brand, flavor, dietary preference, ingredient, product subtype, and cuisine type—at Walmart catalog scale. We then introduce an intent-augmented dense retrieval framework, where predicted intents are incorporated into query and product representations within a bi-encoder, enabling more precise matching between queries and sponsored products. To support real-world usage, we deploy the system as a scalable inference service. The distilled student model is served via a high-throughput API powered by vLLM, enabling efficient intent prediction over large product catalogs with low latency. This design ensures that
intent-aware retrieval can be applied in production settings while maintaining efficiency and scalability.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mengying is currently the Head of Data & Product Growth at Braintrust, an AI observability and evaluation platform, where she leads all data initiatives and self-service business strategy. She’s also an a16z scout and an active angel investor in data tools, developer tools, and B2B SaaS. Previously, she led growth and data at MotherDuck and Notion.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session walks through a practical, end-to-end framework for evaluating AI applications in production. We start with foundations: what evals are, when to start, and why they’re a team sport across engineering, product, and domain experts. From there, we cover the common eval process — gathering improvement signals from production logs, user feedback, and human review; defining success criteria with primary metrics, tracking metrics, and guardrails; and building scorers (code-based, LLM-as-a-judge, and human review) matched to the right use case. The second half focuses on agentic AI evaluation: how to assess task success, tool accuracy, and cost for single agents, then layer on orchestration, routing, and conversation quality metrics for multi-agent and multi-turn systems. We close with remote evals — testing real agents against real dependencies without mocking — and the mindset shift that eval is not a gate but a continuous improvement loop.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Sriram Selvam is a Senior Software Engineer at Microsoft AI with over 14 years of industry experience, specializing in generative search and the deployment of Large Language Model (LLM) applications across distributed systems. He was a core founding team member behind Bing’s Generative Search framework and continues to build AI solutions that enhance user experiences at scale.
Alongside his engineering role, Sriram is an independent researcher deeply invested in the ethical challenges of AI, particularly long-term privacy, sensitive data memorization, and responsible model behavior. His recent work includes co-developing PANORAMA, a large-scale synthetic dataset of 384,000 samples from realistic human profiles, built to model the distribution and context of Personally Identifiable Information (PII) in online content. This work enables robust model auditing and provides researchers with the open-source tooling needed to evaluate privacy-preserving mitigation strategies. Sriram holds an M.S. in Computer Science from the University of Utah.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
To address the critical gap in privacy risk assessment, we introduce PANORAMA (Profile-based Assemblage for Naturalistic Online Representation and Attribute Memorization Analysis). PANORAMA is a large-scale, fully synthetic text corpus containing 384,789 samples derived from 9,674 internally consistent synthetic human profiles. Generated using constrained selection and reasoning LLMs, the dataset spans six distinct online modalities, including social media posts, forum discussions, reviews, and marketplace listings. This session will explore how PANORAMA accurately emulates the naturalistic distribution and variety of sensitive data, enabling researchers to systematically study PII memorization, conduct rigorous model auditing, and benchmark privacy-preserving techniques without exposing real user data.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ankit Haseeja is a cloud and AI enthusiast with deep expertise in designing scalable and cost-efficient cloud architectures. He holds all AWS certifications and has earned the prestigious AWS Golden Jacket, demonstrating exceptional proficiency across the AWS ecosystem.
With a strong background in cloud engineering and system design, Ankit specializes in building high-performance, production-grade solutions and optimizing cloud costs at scale. He is passionate about sharing practical insights on cloud, AI, and modern architecture patterns to help others grow in the tech space.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agentic AI systems are rapidly evolving from experimental prototypes to enterprise-critical applications, yet most organizations struggle to scale them reliably in production. The challenge is not just about increasing compute, but managing orchestration complexity, controlling costs, and ensuring resilience across distributed cloud environments.
This session explores how MCP (Multi-Cloud Practices) can be leveraged to address these challenges and enable scalable, production-grade agentic AI systems. We will dive into architectural patterns, intelligent orchestration strategies, and cost optimization techniques that go beyond brute-force scaling.
Attendees will gain practical insights into designing resilient, efficient, and enterprise-ready AI systems, along with real-world learnings on how to scale agentic workflows across cloud environments while maintaining performance and cost efficiency.
WHAT YOU’LL LEARN:
Develop advanced multi-agent coordination models, including state management and fault isolation, for reliable scaling
ABOUT THE SPEAKER:
Dippu Kumar Singh has over 16 years of experience at the intersection of industry innovation and advanced research. He is a recognized authority in building scalable, trustworthy, and commercially viable AI systems. Being a Leader for Emerging Technologies at Fujitsu North America, Dippu specializes in bridging the gap between theoretical AI concepts and enterprise-grade implementation. His strategic leadership has spearheaded multi-million in sales pipelines and delivered remarkable savings through AI-driven optimizations in transportation, manufacturing, utilities, and supply chain logistics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Stateless autonomous agents in production typically stall at a 44% task success rate due to repeated API failures, a pattern of relying on ephemeral context windows instead of persistent learning.
To close this gap, we present Agentic Memory, a reflection-episodic memory architecture that combines vector storage, automated reflection loops, and heuristic extraction to enable continuous agent learning without model fine-tuning. Across diverse simulated enterprise workflows including IT incident response and data pipeline orchestration, Agentic Memory achieves task completion rates ranging from 85% to 95%, with peak performance (93.3%) in complex, multi-step scenarios.
The method outperforms five standard stateless and naive-RAG agent baselines across all evaluation scenarios. Our background “Critic” process extracts and indexes failure heuristics with near-zero latency overhead (latency penalty ≤ 0.05s) while significantly reducing the API error rate compared to baseline approaches. The framework integrates episodic vector stores, actor-critic reflection patterns, and shared experience banks to address the amnesia and reliability gap in autonomous agent operations.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Matt Mazzarell is AI Lead, Financial Services, Americas at Teradata. He helps customers across the US and Canada apply AI with Teradata technologies to solve high-value business problems. His extensive experience with distributed processing platforms enables customers to optimize their AI workflows to run at large scale. He has built AI capabilities that have shaped product direction and driven meaningful shifts in customer strategy. Before joining Teradata, Matt worked in both startup and established engineering environments, where he honed his skills across multiple disciplines.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
All organizations hold valuable information that originates as or can be translated into text. AI models extract semantic meaning and store the results in large-scale vector databases, which LLMs can then leverage to derive actionable insight. The desired goal is to scale with large enterprise datasets without compromising analytical integrity. This session presents a workflow that automatically segments text data, identifies patterns, and generates insight to drive business outcomes. Two case studies will be discussed: Database Query Optimization and Automated Topic Detection — two very different problems solved with similar analytical techniques operating at massive scale.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dhari Gandhi is a PMP-certified AI Project Manager at the Vector Institute, where she leads applied AI initiatives that bridge technical delivery with real-world impact. She is also a co-author of the ResAI (Responsible AI) Guide, an open-source, risk-based framework designed to help decision-makers, from product leads to compliance teams, govern generative AI responsibly through actionable strategies that address real deployment risks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI adoption accelerates across sectors, many organizations are discovering a persistent gap: technically strong models often fail to translate into measurable real-world value. In practice, AI initiatives frequently stall not because the models underperform, but because economic impact, operational fit, and long-term sustainability were never rigorously evaluated.
This executive-focused talk introduces a practical framework for embedding economic evaluation throughout the AI lifecycle, from business problem definition and data acquisition through deployment, monitoring, and scale decisions. Moving beyond traditional model metrics, the session outlines how organizations can systematically assess costs, benefits, risks, and time horizons to determine whether an AI system is truly worth building and sustaining.
Drawing on applied experience supporting AI deployments across healthcare, finance, and public sector contexts, the talk presents:
Designed for executive and senior technical leaders, this session provides a clear, actionable lens to move AI initiatives beyond experimentation toward accountable, economically defensible deployment. Attendees will leave with concrete strategies to forecast value earlier, avoid common failure patterns, and make more confident scale-or-retire decisions for AI investments.
WHAT YOU’LL LEARN:
Practitioners and leaders can immediately apply:
ABOUT THE SPEAKER:
Mario Lazo is a Principal AI Solution Architect at Insight Global Consulting, where he leads large-scale AI and automation programs across healthcare and financial services. His work focuses on translating AI strategy into production systems that deliver tangible operational and human impact.
Mario has implemented high-volume document automation processing over 6,000 invoices per day for a major health system and earned an Innovation Award for deploying a fax referral automation solution that reduced patient intake delays—improving care coordination and saving lives.
He previously served as graduate faculty teaching the evolution from “vibe coding” to applied agent engineering, and authored AI Data Privacy & Protection. He sits on the TMLS 2026 and MLOps World Steering Committees and is completing the MIT Professional Education program in Applied Agentic AI for Organizational Transformation.
His practitioner ethos is built around a single principle: closing the gap between AI pilots and production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every agent deployment has a postmortem — or it should. Across 120+ production workflows in healthcare, financial services, and global telecommunications, the pattern behind both the failures and the survivors is consistent: the most expensive problems weren’t technical, they were organizational.
This talk examines the hidden variable that determines whether production AI systems live or die in regulated environments: the organizational readiness to close the meaning gap between what an agent outputs and what a human actually needs to act responsibly — and to build enough trust to keep that system online after the first anomaly.
LLMs optimize for statistical similarity; humans require meaning. “”””Top 10,”””” “”””Best 10,”””” and “”””Highest 10″””” can cluster identically in vector space, yet imply completely different decisions for the person on the other end. Multiply that LLM Similarity Trap across every handoff between agent output and human judgment, and you get the most expensive unsolved problem in enterprise AI — one no model update will fix.
This is not a framework talk. It is a what-actually-happened talk grounded in real incidents: an executive who shut down eleven weeks of production gains after a single anomaly because no one gave her a vocabulary for “”””91% accuracy””””; a frontline team that quietly routed around a bot for six months because it never fit how they actually worked; and a clinical team that became the loudest internal champions for an AI system because they were involved before a single line of production code was written.
From these cases and my post-go-live experiences implementing GenAI workflows and agents, the talk distills a six-part operational playbook: the 4 Modes of Human-Agent Collaboration, the Agent Seniority Ladder, the Dignity Clause, the 3-Tier AI Literacy Model, the Aikido Framework for organizational resistance, and the Protocol of Interaction. Each emerged from a production failure — not a lab or a slide deck. The playbook is grounded in a four-layer architecture that connects technical human-in-the-loop design to organizational accountability — because confidence thresholds and escalation routing without named human ownership above them are just infrastructure with no one responsible for what comes out.
The audience leaves with one question: in your current agent deployment, who owns the meaning gap — and how, specifically, are you closing it?
WHAT YOU’LL LEARN:
Name the Meaning Gap owner before go-live. One person accountable for the output-to-decision handoff. If you can’t name them, your HITL escalations have nowhere to land.
Design your human review queue before your confidence thresholds. Zone 2 and Zone 3 only work if reviewers know what adequate review requires.
Require a reasoning chain, not just an output. Reviewers who see only the output rubber-stamp. Reviewers who see the reasoning evaluate.
Log the downstream outcome of every override. Without it, your override log is audit infrastructure. With it, it’s your primary recalibration signal.
Instrument for human behavior, not model performance. Override rate, abandonment rate, and review velocity catch what accuracy dashboards miss.
Translate accuracy into a decision vocabulary before go-live. 91% accuracy means nothing to an executive without a protocol for the 9%.
Assess the autonomy mismatch before you commit to architecture. Advancing through confidence zones is technical. Advancing through the Agent Seniority Ladder is organizational. Both have to move together.
ABOUT THE SPEAKER:
Korede Adegboye is a Machine Learning Engineer at Priceline focused on AI quality, reliability, and evaluation systems. Their work centers on helping teams move fast with confidence by building evaluation frameworks, feedback loops, and safety nets for ML and LLM systems. They focus on making quality measurable in practice, detecting when performance regresses, and giving teams clearer signals for when systems are ready to ship or need attention.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most teams can get an LLM workflow to look good in a demo. Far fewer have a reliable answer for what happens after that.
This talk focuses on the day 2 to day 10 problems of real-world LLM systems: how to keep evals useful as prompts, workflows, and failure modes change over time. I’ll share a practical framework for operationalizing evals through automated dataset curation, failure-mode detection, and uncertainty-aware decisioning. I’ll also cover how batching and flexible compute can make evaluation fast enough to support real developer workflows instead of becoming a bottleneck.
The session bridges familiar ML evaluation discipline with the realities of LLM systems. I’ll show how to move beyond static benchmarks, how to use failure signals to keep eval sets relevant, and how to design eval feedback loops that help teams move faster with more confidence. The goal is to give practitioners a concrete path from ad hoc prompt checks to a durable evaluation system for production LLMs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anthony is a Senior Research Machine Learning Scientist, leading a team responsible for delivering predictive models in the finance & trading space. Prior to joining Layer 6, Anthony completed a PhD in the Department of Statistics at the University of Oxford, with a focus on statistical machine learning and generative modelling. He also completed a BMath and MMath at the University of Waterloo, with several internships focused on finance and research. Besides the applied side, Anthony has also helped deliver over fifteen research papers to top conferences and journals whilst at Layer 6, focusing on the areas of generative modelling, tabular data analysis, and anomaly detection.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Tabular data is ubiquitous worldwide, driving solutions for generic business problems, applied time series forecasting, and beyond. This inherent heterogeneity had hindered Tabular Foundation Models (TFMs) from rapidly generalizing to unseen datasets. In-Context Learning (ICL) offers a promising path for TFMs, enabling dynamic task adaptation without fine-tuning. Moving beyond re-purposed language models, we propose combining ICL-based retrieval with self-supervised learning to train dedicated TFMs. We evaluate real versus synthetic pre-training data, demonstrating that real data captures complex signals critical for improving downstream generalization. Incorporating this real data yields significantly faster training and superior adaptability across diverse contexts. Our resulting model, TabDPT, achieves strong performance across varied classification and regression benchmarks. Importantly, our pre-training procedure demonstrates that scaling model and data size drives consistent, power-law performance improvements. This echoes foundational scaling laws, confirming that robust, large-scale, and equitable TFMs are highly achievable. We have open-sourced our complete training and inference pipeline.
WHAT YOU’LL LEARN:
Tabular foundation models are continuing to vastly improve. Real data has been shown to be a legitimate option for pre-training despite previously being underutilized in favour of synthetic pre-training data. We see as well that tabular foundation models are starting to demonstrate scaling laws much like LLMs.
ABOUT THE SPEAKER:
Zahra Shekarchi is a Lead Research Engineer at Thomson Reuters, where she tech leads AI and Information Retrieval applications for the legal domain. She brings 9 years of experience across Search, Recommendations, MLOps, and Generative AI. Her work spans cognitive science, health, media, and legal industries. She’s passionate about building better practices so teams can focus on the work that matters most.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In the legal domain, moving from a successful prototype to a production-grade system is rarely a straight line. Scaling trustworthy AI and Information Retrieval solutions requires a transition from experimental algorithms, RAG, and agentic prototypes to production-grade systems with a robust delivery strategy. The challenge extends beyond technical implementation to the intersection of research uncertainty, engineering and operational constraints, team dynamics, and product alignment.
This session outlines a framework for a Technical Lead, guiding teams through this complexity, drawn from the experience of delivering high-stakes regulated AI solutions.
We will show how establishing clear metrics and progressive target values defines what ‘Good’ means to stakeholders. These targets become the shared objectives between technical teams and Product stakeholders that build trust through an iterative discovery and delivery process. Each sprint then demonstrates measurable progress beyond experimental “”””vibes.”””” These shared metrics also inform capacity-effort planning, enabling honest reprioritization when resources are constrained. This method prevents falling into the ‘Hero Trap’ of unsustainable delivery, a pressure familiar to Tech Leads.
We will reframe technical debt not as a drawback, but as a calculated liability—treating it as a high-ROI instrument for speed-to-market and as a strategic onboarding tool for new team members. Furthermore, we will address risk management through actively challenging our thinking; seeking uncomfortable opposing views, constructing honest pros/cons analyses, and stress-testing assumptions before they become costly commitments. This practice reduces cognitive biases, surfaces hidden risks early, and fosters more inclusive and psychologically safe team environments where dissent is treated as a resource rather than resistance.
Finally, the session will turn to the human side of delivery. We will cover team practices that sustain velocity: integrating SME feedback loops to iterate quickly and learn early, shielding researchers from agile ceremony fatigue, celebrating every win and learning from setbacks, and staying adaptable through ambiguity and changes. We will also address the often-invisible “glue work” that sustains production excellence, presenting original survey data from Engineering, Science, and Product teams to quantify its impact and offer methods to identify common blind spots.
Target Audience:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mehdi Rezagholizadeh is a Principal Member of the Technical Committee at AMD. Before joining AMD, he was a Principal Research Scientist at Huawei Noah’s Ark Lab Canada, where he worked since 2017 and served as the leader of the Canada NLP team for over six years. His research and projects focused on deep learning and its applications in NLP, computer vision (CV), and speech processing. He has contributed to advancements in generative adversarial networks, computational NLP, and efficient solutions for training, model architecture, and inference of pre-trained models.
Mehdi holds more than 15 patents and has authored over 50 publications in leading conferences and journals, including TACL, NeurIPS, AAAI, ACL, NAACL, EMNLP, EACL, Interspeech, and ICASSP. Additionally, he has actively contributed to the academic and industrial communities by organizing prominent workshops, such as the NeurIPS Efficient Natural Language and Speech Processing (ENLSP) workshops (2021–2024), and by serving on technical committees for ACL, EMNLP, NAACL, and EACL, including as Area Chair and Senior Area Chair for NAACL 2024. Over his career, he has successfully supervised more than 20 M.Sc. and Ph.D. interns in both industrial and academic settings.
He earned his B.Sc. in 2009 and M.Sc. in 2011 from the University of Tehran and completed his Ph.D. in 2016 at McGill University in Electrical and Computer Engineering (Centre for Intelligent Machines).
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Long-context capabilities are becoming essential for modern LLM applications, from document understanding and agent workflows to code, RAG, and multi-step reasoning. But supporting long sequences efficiently is still one of the hardest practical challenges in large-scale model training and inference, especially when memory, bandwidth, and latency become the real bottlenecks rather than raw compute alone. In this talk, I will discuss the key systems and modeling considerations for long-context training and inference on AMD GPUs. I will cover the main sources of cost in long-sequence workloads, including KV-cache growth, attention complexity, memory movement, parallelism strategies, and kernel efficiency. I will also discuss practical approaches to make long-context workloads feasible in production and research settings, including efficient attention variants, context extension strategies, distributed training design, cache optimization, precision choices, and inference-time serving trade-offs. The talk is grounded in a practitioner-focused perspective: what actually matters when moving from promising ideas to stable, scalable implementations on AMD hardware. The goal is to provide a clear view of the design space and a practical roadmap for building efficient long-context systems on modern AMD GPU platforms.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmed Radwan is a Machine Learning Specialist at the Vector Institute, where his research sits at the intersection of multimodal AI, large language models, and responsible AI. He is the lead developer of SONIC-O1, the first open-source omnimodal benchmark for evaluating multimodal LLMs on real-world audio-video understanding, and the creator of UnBias+, a production-grade open-source toolkit for automated bias detection and debiasing in text. His broader work spans agentic system design, LLM hallucination reduction, and fairness evaluation, with research published in IEEE journals and international AI conferences. He has conducted research across the Vector Institute, York University, and KAUST.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Experiments on closed- and open-source models reveal limitations. While the performance gap in MCQ accuracy between two model families is relatively small, we observe a substantial 22.6% performance difference in temporal localization between the best performing closed-source and open-source models. Performance further degrades across demographic groups, indicating persistent disparities in model behavior. Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Anshuman Panwar is an AI leader in financial services, focused on deploying production-grade machine learning and Gen-AI in regulated environments. He leads end-to-end delivery of ML systems—from signal research and model development to governance, monitoring, and integration into business workflows—across domains including sales prospecting and investment decisioning. His work emphasizes measurable impact, auditability, and scalable operating models for enterprise AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Modern institutional sales is low-frequency and high-stakes: a small coverage team needs early, defensible signals on which allocators (pensions/endowments/insurers) are likely to be “in market” for a new mandate before an RFP appears. This session walks through a production-ready prospect ranking system that handles missing/noisy CRM data, learns intent using proxy targets under delayed ground truth, and augments structured models with controlled LLM-assisted research from unstructured sources (e.g., mandate news, personnel changes, policy updates). I’ll cover evaluation choices for top-K ranking (precision@K, stability, and leakage traps), operational handoff to human coverage, and the feedback/monitoring loop that keeps recommendations actionable over time.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hagay is Senior Vice President of AI Inference at Cerebras Systems, where he leads the development of the world’s fastest AI inference service powered by the Cerebras Wafer Scale Engine. He brings over 20 years of experience across software engineering and machine learning, with leadership roles spanning Meta AI, AWS ML, and Databricks Mosaic AI. His work focuses on large-scale AI infrastructure for training and serving state of the art AI models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most developers use LLM APIs like a black box: pick a model, send requests, and live with whatever performance they get. This talk argues that this is the wrong way to think about performance. Modern inference stacks implement optimizations such as prompt caching, speculative decoding, disaggregated inference, and more. But for those optimization gains to be maximized, the application needs to use the API in the right way. I will explain the key optimizations that matter, how they work at a high level, and what API users can do to fully benefit from them in practice. The focus is not on theory or provider internals. It is on helping practitioners get better real-world performance from the same LLM APIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Karthik is an AI Strategist at Teradata, supporting Financial Services customers in the US and Canada. As a Data Scientist and Technologist, he works to create interesting solutions that help create business outcomes for our customers. Before Teradata, Karthik worked for various startups supporting customers in forward engineering roles. He has also had several cofounding member roles in companies and currently holds several patents across various domains. Karthik lives in the Silicon Valley Bay Area.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Financial institutions collect data from countless touchpoints, including call transcripts, chatbots, branch visits, transactions, and more, yet many struggle to extract meaningful insight from these interactions. Specifically, understanding the “customer task” or the paths that lead to significant outcomes remains a persistent challenge. Solving this unlocks something valuable: a clearer view of the intent behind customer behavior, enabling better offers, smarter services, and stronger retention.
While these discrete event sequences aren’t traditional NLP, they carry their own vocabulary and context and timestamps. This talk explores how to build both white-box and deep learning transformer/generative models and walks through the tradeoffs between accuracy, explainability, and inferencing complexity. The result is a practical framework that lets businesses select the right model for the right use case, whether regulatory constraints apply or not, while still achieving the same core objective.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
As Director of Data Science at Wealthsimple, Lin Liu architects AI/ML solutions that power the future of finance. His experience includes leading AI/ML consulting engagements for AWS clients at Amazon and creating flagship fraud and credit models for Capital One Canada. A patented inventor in credit scoring, Lin specializes in building scalable AI/ML solutions that bridge the gap between data science and tangible business value.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Foundation models have achieved remarkable success in natural language and vision, but applying the same paradigm to structured, sequential event data — transactions, interactions, behavioural signals — introduces a distinct set of technical challenges that existing literature largely overlooks.
In this talk, we share what we learned building and productionizing a domain-specific foundation model trained on millions of heterogeneous event sequences. We dig into:
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Javeria Ahmed is a Senior Manager at RBC working on Retail Risk Models with a background in Computational & Applied Math with 4+ years of experience in the financial services sector. Javeria has led projects and models focusing on the intersection of risk modelling and the automotive industry and is particularly passionate about auto shopping behavior, dealer gaming and fraud and their impact in the viability of risk models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Feature importance estimation is crucial for model interpretability, but traditional permutation-based methods break down when features exhibit dependencies. Standard permutation importance shuffles features independently, creating out-of-distribution samples that don’t reflect realistic data relationships—leading to unreliable and often misleading importance scores. As warned by Hooker et al. (2021), “unrestricted permutation forces extrapolation.”
This talk introduces a conditional subgroup approach for computing model-agnostic feature importance that respects feature dependencies through row and column blocking strategies. The method combines two complementary Model-X techniques that model the joint feature distribution:
The approach uses Fraction of Variance Unexplained (FVU) as a variance-based sensitivity measure with well-defined bounds [0,1], making it comparable across problems. Unlike SHAP or standard permutation importance, this method correctly handles multicollinear features without requiring model retraining or manual feature dropping.
WHAT YOU’LL LEARN:
Applying steps (1)–(5) leads to more conservative (less exaggerated) importance scores. The maskon library implements these steps and can be easily integrated into a scikit-learn workflow.
ABOUT THE SPEAKER:
Olivier Blais is the Co-founder and VP of Artificial Intelligence at Moov AI, a Publicis Groupe company and Canadian leader in applied AI and data solutions. He has led strategic AI initiatives for over 100 organizations across the country.
Appointed in 2025 by Canada’s Minister of Artificial Intelligence, Evan Solomon, to the national AI Strategy Task Force, Olivier helps shape the country’s long-term vision for AI innovation, ethics, and competitiveness. He also serves as Co-Chair of the Government of Canada’s Advisory Council on AI, Chair of the Canadian Mirror Committee on AI at ISO/IEC, and Project Leader of the ISO standard on AI system quality and conformity assessment, driving global discussions on responsible and trustworthy AI.
A trusted advisor to major enterprises such as Air Transat, Industriel Alliance, and Pratt & Whitney, Olivier champions AI that empowers people — delivering real impact through innovation, security, and societal progress.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Everyone agrees that moving from conversational tools to autonomous, multi-agent systems is the future. But the demos lie. In production, the “Agentic Shift” is hitting a massive wall: evaluation is fundamentally broken. When engineering, legal, and product teams can’t even agree on the definition of “good,” processes fail, user trust evaporates, and compliance teams hit the brakes.
Drawing from international efforts to standardize AI system quality and conformity assessment, alongside hard lessons learned deploying autonomous agents in highly regulated industries (banking, insurance, healthcare) this session bypasses the hype to look at why AI evaluations actually fail. We will dissect the gap between human-machine teaming in theory versus reality, why logging traces isn’t enough, and how to build scenario-based evaluations that satisfy both the engineers building the agents and the governance teams protecting the enterprise.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Travis is an expert solutionizer who likes long walks in the park and tinkering with interesting technology.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Fine-tuning feels like the natural next step when your model isn’t performing — but it’s often the wrong one. Before committing to a training run, it’s worth asking: have you fully exhausted what you can achieve without touching the weights?
In this talk, we’ll break down the tradeoffs between prompt optimization and fine-tuning — when each approach earns its cost, and what the signals look like in practice. We’ll make it concrete using Weights & Biases Models and Weave, walking through a real evaluation workflow that tracks experiments, surfaces behavioral differences, and helps you measure whether a change actually moved the needle.
There’s no universal answer to which approach wins. But there is a better way to find out — and it starts with having the right evals in place before you make the call.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Tyson Macaulay is an experienced executive in cybersecurity and networking. He has worked extensively in Data Center (DC), High Performance Computing (HPC), telecommunications, and blockchain technologies. In his current role as Director and COO of 01 Quantum, Tyson guides go-to-market strategy for AI security. He is also active in the energy sector, advancing quantum-safe digital assets. Prior to 01 Quantum, Tyson was VP of Solution Architecture at Cerio, a leading performance networking company. He previously held senior roles at BAE Systems as CTO of the Cyber Security Division, CTO of Telecommunications Security at Intel, and Chief Security Strategist at Fortinet. Tyson is an active security researcher and Deputy Director of Carleton University’s National Centre for Critical Infrastructure Protection. His body of work includes books, peer-reviewed publications, international standards, and patents.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
This session analyzes trade-offs between AI encryption overhead and latency using open source Fully Homomorphic Encryption (FHE) and model optimizations. Presenting AI penetration tests and performance data from the NC-CIPSeR Substrate Lab at the Carleton University, we demonstrate how optimized FHE orchestration enables secure and performant AI deployment. Learn to recognize the strategy and specific use-cases for prompt and model encryption of expert AIs.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Zeke Miller is Director of Engineering for Agent Factory at Workday, where he combines deep expertise in programming language theory, code health, and AI to build production-grade agentic systems for data and software engineering. Previously a Staff Software Engineer at Google, he was the Uber Tech Lead for Gemini for Data, leading efforts on Code Assist, Conversational Analytics, Data Science Agents, NL2SQL, and LookML generation in Google Cloud. Zeke’s background spans Code AI in Google Labs and privacy-centric systems in Ads Privacy Sandbox, grounded in a Computer Science degree from the Rochester Institute of Technology, giving him a uniquely practical perspective on how LLMs and agents are transforming the software and data stack at scale.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most enterprise AI architectures put the “intelligence” in a thick agent layer: elaborate tool graphs, custom planners, and hand‑rolled orchestration logic. That feels robust—until the next model, framework, or interaction pattern ships and you’re stuck rewriting the whole thing. In this talk, Zeke Miller shares how Workday’s Agent Factory team flipped that mindset: treating agents as a thin, rewritable veneer on top of a thick foundation of systems of record, systems of action, and governance. Using real examples from conversational analytics and HR/finance workflows, he’ll show how a single well‑designed connector plus strong evals outperformed months of bespoke agent engineering, how they use evals as an executable contract between users and systems, and how they bake security and auditability in from day one. Attendees leave with a concrete mental model and patterns for building agentic systems that can change quickly without breaking the parts of the business that can’t.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Alet Blanken is Vice President of AI Engineering at Workday, where she leads the strategy, development, and deployment of Generative AI solutions that transform analytics across Looker, BigQuery, and large-scale databases. With over 15 years of experience building and leading high-performing engineering teams at Google Cloud, Amazon Web Services, and ACI Worldwide, she operates at the intersection of Generative AI and data analytics to deliver scalable, secure, and production-ready systems. Her work spans LLMs, retrieval-augmented generation (RAG), anomaly detection, and predictive modeling to unlock actionable insights and automate complex analytical workflows. Alet holds degrees in Information Technology and Industrial Psychology, along with a PMP and AWS Solutions Architect certifications, and brings a rare blend of deep technical expertise and human-centered leadership to the TMLS stage.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Every software company claims to be becoming an AI company. In practice, most are re‑running the wrong playbook: treating AI like another infrastructure migration instead of the current shift in how products are designed, shipped, and operated. In this talk, Alet Blanken, VP of AI Engineering at Workday, shares a practitioner’s playbook for that transition, grounded in Workday’s journey building agentic systems in HR and finance at scale. She’ll cover why AI is not analogous to on‑prem → SaaS, how product design must start from the first demo and traffic patterns, and why code is now the cheapest part of the stack. Attendees will see how Workday structures its architecture around durable systems of record and action, fast iteration loops on real usage data, and a culture that treats reliability, latency, and trust as first‑class metrics. The goal is to leave with a realistic picture of what it takes for a software company to truly operate as an AI company.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Swanand Gupte is a seasoned Artificial Intelligence executive and strategist dedicated to navigating the intersection of advanced analytics and business transformation. Currently leading key AI initiatives at TELUS, Swanand focuses on modernizing enterprise capabilities through the deployment of next-generation technologies, including Agentic AI and MLOps. He is passionate about building high-performance teams and creating scalable architectures that translate complex data into best-in-class customer experiences.
With a professional foundation rooted in management consulting at McKinsey & Company, Swanand brings a disciplined, global perspective to driving innovation and operational efficiency. His expertise lies in bridging the gap between technical complexity and executive strategy, ensuring that AI investments deliver measurable value and sustainable growth. Swanand holds an MBA from the University of Chicago Booth School of Business
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI transitions from experimental PoCs to core business infrastructure, the primary bottleneck has shifted from technical feasibility to organizational adoption. This session explores the dual-track strategy TELUS uses to drive meaningful business outcomes at scale. We will break down the “”Bottom-Up”” approach—democratizing AI through self-serve LLM sandboxes and employee enablement—and the “”Top-Down”” approach—leveraging a specialized AI Accelerator to solve high-impact, complex business problems.
Attendees will learn how Telus integrates Sovereign AI into its roadmap and why the modern AI leader must pivot focus from the “”How”” (technical build) to the “”What”” (problem selection and change management) to bridge the “value gap” in enterprise AI.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ramin is a Machine Learning Engineer at TELUS with over seven years of experience. His focus areas include NLP, AI-driven automation, and building ML systems that operate reliably at scale. When not working, he enjoys hiking local trails and spending time at the gym — especially during Vancouver’s frequent rainy days.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Detecting anomalies across thousands of high-dimensional time-series streams is hard; explaining them to the engineer who has to act is harder. In this session I’ll walk through the system we built and deployed at TELUS to close the gap end-to-end: a hybrid multi-head LSTM that trains a dedicated encoder per KPI, paired with an adaptive threshold that scales by hour-of-day to suppress the false positives static thresholds produce during predictable daily cycles.
Detection is half the work. The session’s second half is the explainability and resolution layer: we isolate the sector counters and individual cells driving each anomaly, an LLM turns those signals into a diagnosed cause and a ranked list of recommended actions from a YAML-defined playbook, and an outcome store records what the engineer actually did and whether the KPI recovered. Those outcomes feed back as Wilson-smoothed action win-rates that re-rank future recommendations — closing the learning loop.
The architecture generalizes to any domain where rare events hide in correlated time-series — fraud, observability, IoT.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Kai Wei Tan is a Senior Forward Deployed Engineer at CoreWeave, where he partners closely with enterprise customers to design and deploy production-grade AI systems. Previously a Lead AI Software Engineer at Boston Consulting Group, he built and scaled generative AI solutions for Fortune 100 companies, leading end-to-end development of LLM-powered agents and real-time decisioning systems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Getting LLMs to reliably call tools in production is not just a prompting problem but also a training problem but most practitioners lack a principled way to measure progress. This talk uses tau2-bench, a rigorous tool-calling benchmark, as the backbone for a complete fine-tuning workflow: generating training scenarios, running supervised fine-tuning, and applying reinforcement learning to push past the ceiling of imitation. The result is a model measurably better at domain-specific tool use with concrete before/after numbers. Practitioners leave with a reusable approach: use a structured benchmark to drive your fine-tuning loop, not just to evaluate at the end.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Areeb Khawaja is a Technical Product Manager at TELUS. He works at the intersection of AI, APIs, data products, and platform strategy, where he’s currently leading the development of the TELUS API Marketplace. His focus is on enabling partners and developers to securely access and monetize data capabilities, while building scalable, privacy-conscious digital ecosystems.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As enterprises race to build AI-powered products and services, APIs are becoming more than technical interfaces. They are the foundation for new business models, partner ecosystems, and scalable digital distribution. But turning APIs into trusted, monetizable products inside a large enterprise is not just a technical challenge. It requires alignment across product strategy, privacy, governance, architecture, commercial models, and partner onboarding.
In this session, we will share lessons from helping take the TELUS API Marketplace from inception to launch, with a focus on the real-world decisions required to operationalize external-facing APIs in a complex enterprise environment. The talk will explore how we approached marketplace design, organization vetting, consent and trust considerations, commercialization, and the challenge of making technical capabilities usable and valuable for partners.
Rather than presenting a marketplace as a storefront alone, this session will frame it as a trust architecture for the AI economy: a system that must make access safe, scalable, and commercially viable. Attendees will leave with a practical playbook for evaluating which enterprise capabilities can become API products, how to design governance into the product from day one, and how to bridge the gap between technical possibility and market adoption.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ketan Umare is Co-Founder and CEO of Union.ai, an AI development infrastructure company helping organizations build, deploy, and scale production AI. Union.ai provides a single platform that unifies infra-aware orchestration, model training, inference, and compliance, enabling teams to escape pilot purgatory and ship AI faster.
Ketan is also a leading contributor to Flyte, the open-source, Kubernetes-native AI/ML orchestrator used by 3,500+ companies. He led the original engineering team behind Flyte, building it to power dynamic, large-scale, and fault-tolerant AI workflows. Today, Union builds on that foundation to help enterprises operationalize mission-critical AI systems with lower costs, faster iteration cycles, and production-grade reliability.
Prior to founding Union, Ketan held senior engineering leadership roles at Amazon, Oracle, and Lyft, where he worked on large-scale distributed systems and data platforms.
In his spare time, he enjoys spending time with his two daughters and exploring the outdoors.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Agent demos are easy; durable production agents are not. While tools like Claude Code and OpenClaw simplify prototyping, teams still need to manage the code, context, tools, and infrastructure that make agents work in real environments. This talk breaks down the orchestration stack behind production agents: how to make them observable, debuggable, and durable, and how to design for recovery when failures happen across reasoning, tool use, networking, and execution. Drawing from real-world engineering experience, the session will outline practical patterns for building self-healing agent systems that can operate reliably in production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.
We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.
The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.
We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.
Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Abhimanyu is a Senior Data Scientist at Elastic, where he works on the development and evaluation of enterprise-grade AI agents. He holds an M.Sc. in Big Data Analytics from Trent University, specializing in natural language processing.
Throughout his career, he has designed and deployed robust AI solutions across a range of industries, including social media, e-commerce, and metals and mining.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Your agent eval says accuracy improved. But did latency spike? Does your LLM-based metric even agree with human judgment? And is that 5% gain real or noise? Do we ship it or not?
If you’re evaluating AI agents, you’ve likely encountered hidden failures such as:
In this session, I’ll walk through how we addressed these at Elastic. Using a real experiment as an example, I’ll cover the evaluation setup we built to catch these failures. This includes multi-metric evaluation to expose tradeoffs (accuracy, tool usage, and latency) and a claim-level correctness evaluator (we developed in house) validated against human judgment to ensure LLM-based scores are meaningful. I’ll also discuss key significance testing principles we used to filter out noise and verify real gains.
Along the way, I’ll show the prompt structure behind our evaluator and examples of practical results.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Deepkamal Kaur Gill is a Senior Applied AI Scientist at Vanguard, where she builds production-grade LLM systems for high-stakes financial applications. Her work spans data generation, post-training, and evaluation, with a focus on building reliable, low-latency AI systems under real-world constraints.
Deepkamal holds a Master’s in Computer Science from the University of Toronto and is an active contributor to the AI community through research, mentorship, and initiatives supporting women in technology. At TMLS, she brings a practitioner’s perspective on what it truly takes to scale LLMs in production.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
While recent advances in LLMs emphasize improved model capabilities, many systems fail to scale in real-world production settings. Beyond a certain point, adding GPUs or data yields diminishing returns: training stops scaling efficiently, hardware remains underutilized, and inference latency is dominated by system constraints rather than compute. These failures are often silent, poorly documented, and difficult to diagnose in distributed environments.
In this talk, we share lessons from building enterprise-scale domain LLM systems, focusing on the system-level bottlenecks that limit scaling in practice. We examine failure modes across distributed training and inference—including communication overhead, pipeline imbalance, numerical instability during training as well as memory-bound decoding, KV cache growth, and throughput–latency tradeoffs at inference—and show how they manifest in production systems.
Rather than introducing new modeling techniques, this session presents a practical, symptom-driven approach to debugging: identifying failure patterns, tracing their root causes, and applying targeted mitigations. The key takeaway is that scaling LLMs is fundamentally a systems problem, and attendees will leave with a concrete framework to diagnose bottlenecks and make better design decisions when moving from prototype to production.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Afsaneh Fazly holds a PhD in AI from the University of Toronto and brings over two decades of experience advancing intelligent systems across academia, industry, and startups. She currently serves as AI Research Director at RBC Borealis. Her work spans foundation models, language and multimodal intelligence, and applied machine learning, with a focus on translating research into real-world impact. She has contributed extensively to the AI research community through publications and patents, and has led and mentored large multidisciplinary teams of engineers, scientists, and researchers. Her career reflects a consistent ability to bridge scientific rigor with large-scale system design and deployment.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
For the past two decades, enterprise software has largely followed the SaaS model: applications organize work through dashboards, forms, and APIs, while humans interpret information and coordinate tasks across multiple tools. Advances in large language models and agentic systems are beginning to change that structure. AI agents can now interpret user intent, retrieve context across systems, call tools, and execute multi-step workflows. As a result, software is gradually shifting from static interfaces toward systems that can actively perform work.
This shift does not mean SaaS disappears. Instead, applications increasingly become execution layers that agents interact with, while reasoning and workflow orchestration move into an agentic control layer above them. In legal workflows, for example, an agent can review large volumes of contracts, extract key clauses, compare them to internal policies, and surface potential risks for human review. In construction and engineering projects, agents can analyze plans, specifications, and contracts together to identify inconsistencies or obligations before they become costly issues in the field. The central question is therefore not whether AI replaces SaaS, but where durable advantage moves when software systems can reason over context and dynamically orchestrate work.
This talk explores how the emerging agentic stack is reshaping the software landscape and what it means for organizations building or deploying AI-driven products. It is intended for founders, product leaders, engineers, and executives who want to understand how AI agents are likely to transform software design, product strategy, and competitive advantage.
WHAT YOU’LL LEARN:
Several practical lessons emerged from the work that led to this talk.
First, organizations should start with workflows rather than models. The greatest value often comes from augmenting complex, multi-step processes rather than introducing isolated AI features.
Second, agentic systems are most effective when built on top of existing infrastructure. Rather than replacing current systems, AI can orchestrate tasks across them, allowing organizations to unlock value without rebuilding their entire stack.
Finally, a strong understanding of the science behind LLMs and agentic frameworks helps leaders make better architectural decisions. Understanding how models reason, retrieve context, and interact with tools makes it easier to design systems that are reliable, scalable, and aligned with real business needs.
ABOUT THE SPEAKER:
Abhinav Arun is a Senior AI Research Scientist at Domyn, where he leads the development of advanced AI systems and large-scale Knowledge Graphs for the financial domain. His work spans multi-agent orchestration pipelines, Knowledge Graph-grounded reasoning, and LLM powered systems for complex financial analytics. He leads research efforts behind the FinReflectKG (one of the largest open source financial knowledge graphs) ecosystem – covering financial multi-hop reasoning, graph-linked causal analysis and question answering, evaluation frameworks, and semantic alignment pipeline – with multiple accepted papers at venues including NeurIPS and ICAIF.
With a strong focus on building responsible and explainable AI systems, Abhinav’s work pushes LLMs to reason more like real financial analysts – grounded in structured, interconnected evidence across filings, vendors, and time. He is passionate about bridging cutting-edge AI with real-world finance, building systems that are explainable, scalable, and analyst-centric.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Multi-hop question answering over financial disclosures is often constrained more by evidence retrieval than by reasoning capability. Relevant facts are dispersed across filings, fiscal periods, and peer firms, and noisy long-context inputs degrade reliability even for strong reasoning models. We introduce FinReflectKG–MultiHop, a large-scale benchmark grounded in a temporally indexed financial knowledge graph derived from SEC 10-K filings, containing multi-hop QA pairs spanning intra-document, inter-year, and cross-company reasoning regimes. Questions are generated from statistically informed 2–3 hop typed motifs and paired with provenance-linked evidence to enable controlled evaluation of evidence structure. Across multiple open-weight reasoning LLMs and six structured evidence protocols, KG-linked provenance consistently improves correctness while reducing token usage by over 70% relative to text-window and semantic retrieval baselines. Our results demonstrate that reasoning reliability in finance is fundamentally governed by evidence composition and structure, and that model scaling alone cannot compensate for poorly organized retrieval contexts.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Luis Ticas is a Toronto-based AI and data science leader with over a decade of experience turning complex data into real business outcomes across life sciences, finance, and insurance. He specializes in enterprise-wide AI adoption, working across domains like call centers, CRM, R&D, and operations, and is known for bridging the gap between strategy and hands-on execution. As a Certified Generative AI Engineer with credentials across Databricks, AWS, Azure, and GCP, he brings a multi-cloud, full-stack perspective to modern AI. Luis is also the AI Lead for Climate Resilient Communities, a nonprofit he architected and built, where he leads initiatives that make climate knowledge accessible through multilingual AI systems and community-driven tools.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Sprout is a Toronto nonprofit operating as an AI- and data-first organization that works alongside urban Canadian communities and grassroots organizations, providing data tools, local insights, and storytelling support that reflects community resilience building work already happening on the ground. With a core team of three, we built and run the Multilingual Climate Chatbot (MLCC) — a production RAG system serving 200+ languages to communities that don’t speak English or French as their first language across Toronto. We did it on a near-zero budget, under real infrastructure constraints, with no option to optimize later. And we did it without compromising on our values: every model and infrastructure choice was evaluated not just on cost and quality, but on environmental footprint. This talk covers four things: team design as architecture, responsible model selection under constraint, what broke in production, and a retrieval architecture built from failure. Every design decision traces back to a constraint the team couldn’t ignore: budget, latency, language quality, team size, or environmental responsibility. Infrastructure cost: $26/month. Languages served: 200+. Team size: 3.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David is a Senior AI/ML Engineer within the Office of the CTO at NetApp, where he’s dedicated to empowering developers to build, scale, and deploy AI/ML solutions in production environments. He brings deep expertise in building and training models for applications such as NLP, vision, real-time analytics, and even classifying debilitating diseases. His mission is to help users build, train, and deploy AI models efficiently, making advanced machine learning accessible to users of all levels.
Before NetApp, he was heavily involved in the AI/ML community, specifically in conversational AI solutions and driving AI platform growth in a DevRel and pre-sales role. David frequently shares his insights at industry conferences and events, offering hands-on guidance for implementing AI/ML in cloud environments. David’s prior experience includes contributing to the Kubernetes and CNCF ecosystems, working hands-on with VMware virtualization, implementing backup/recovery solutions, and developing hardware storage adapter firmware and drivers.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most RAG discussions start and end with vector embeddings. That makes sense because vector search is approachable, fast to prototype, and widely supported. But semantic similarity is not the same thing as answer retrieval. When teams rely on embeddings as the default for every use case, they often end up with systems that sound convincing while returning weak, incomplete, or confidently incorrect answers. This talk reframes retrieval as the real design decision in RAG, not a backend detail.
We will walk through the major retrieval options at a high level, including vector, graph, and BM25 approaches, and explain where each one fits. Then we will show why hybrid designs, such as Vector + Graph and Vector + BM25, often produce stronger results by combining semantic context with stronger grounding and greater precision. The goal is to give AI engineers a practical mental model for choosing a retrieval approach based on the shape of their data and the kinds of answers they need, rather than defaulting to embeddings because everyone else did.
WHAT YOU’LL LEARN:
Too many RAG systems are built around a single assumption: use vector embeddings and figure out the rest later. That works until the answers need to be correct. This session shows AI engineers how retrieval choice drives answer quality, why vector search alone often leads to confidently wrong outputs, and how graph, BM25, SQL, and hybrid retrieval patterns can produce better, more grounded results. It is a practical talk for builders who want to move past the default RAG recipe and design systems that answer with more precision and less guesswork.
ABOUT THE SPEAKER:
Nima holds a Ph.D. in Systems and Industrial Engineering with a strong foundation in Applied Mathematics. He completed a postdoctoral fellowship at the C-MORE Lab (Center for Maintenance Optimization & Reliability Engineering) at the University of Toronto, where he worked on machine learning and operations research (ML/OR) projects in close collaboration with industry and service-sector partners.
He was part of the Maintenance Support and Planning Department at Bombardier Aerospace, applying ML/OR methodologies to reliability and survival analysis, maintenance optimization, and airline operations planning.
Nima is currently a Senior Data Scientist within the Corporate Functions Analytics team at Scotiabank in Toronto, Canada. His research and applied work span machine learning, optimization, and large-scale decision-making systems. He has authored over 40 peer-reviewed journal articles and book chapters in leading venues and holds one granted patent. His work has been featured at major machine learning and AI conferences, including NeurIPS, ICML, NVIDIA GTC, GRAPH+AI, and TMLS.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Causal discovery algorithms infer directed acyclic graphs (DAGs) from observational data but are highly sensitive to hyperparameters, structural constraints, and assumptions that are difficult to identify from data alone. We propose an LLM-guided causal discovery framework in which a large language model (LLM) acts as a domain-aware expert to inform the calibration of causal models. The LLM encodes prior knowledge about plausible causal directions, temporal ordering, lag structures, and exclusion constraints, which are translated into structured priors and tuning parameters for time-lagged causal discovery algorithms.
We apply the proposed approach to macroeconomic systems, where variables exhibit delayed and interdependent causal relationships. Empirical results show that LLM-guided calibration yields more stable and interpretable causal graphs and improves out-of-sample prediction of macroeconomic indicators compared to purely data-driven baselines. This work demonstrates how LLMs can bridge expert knowledge and statistical causal inference in complex dynamical systems.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Ahmad Pesaranghader is an Applied AI Scientist at CIBC, where he focuses on LLM safety. He holds a Ph.D. in Computer Science from Dalhousie University with a background in Machine Learning and Big Data, and has worked across academia and industry with text, image, and biomedical data.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Large Language Models (LLMs) and Large Reasoning Models (LRMs) hold transformative potential for high-stakes domains such as finance and law, but their tendency to hallucinate poses a critical reliability risk. This talk explores detection strategies, including uncertainty estimation, reasoning consistency checks, and factual validation, paired with mitigation approaches such as knowledge grounding, prompt engineering, and confidence calibration. What sets this discussion apart is the emphasis on root cause awareness: by categorizing hallucination sources into model, data, and context-related factors, detection and mitigation strategies can be precisely matched to the underlying cause rather than applied as generic fixes.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Himanshu Joshi is the Founder and CEO of COHUMAIN Labs, where he leads initiatives on collective intelligence between humans and machines. He spearheaded the development of SAFEALIGN AI, a platform focused on the safe, secure, and aligned deployment of agentic AI in enterprises. Previously, he served as a Team Lead for the AI Projects at the Vector Institute for Artificial Intelligence, driving responsible AI initiatives that generated over $170 million in documented enterprise value across Fortune 500 organizations.
An internationally recognized thought leader in agentic AI, governance, and AI security, Himanshu has authored books and LinkedIn Learning courses on AI adoption and published 15+ peer-reviewed papers at leading venues including NeurIPS, ICLR, IEEE, AAAI, and ICDM. He serves as Program Chair for the AAAI 2026 AI Governance Workshop and contributes as a track lead and reviewer across major global AI conferences. He is also the recipient of the AI Ally of the Year 2025 (North America): Special Jury Award.
He completed the EPGM at the MIT Sloan School of Management and is pursuing doctoral research on human-AI collective intelligence, alongside an M.S. in Artificial Intelligence at the University of Texas at Austin. He also holds double M.S. degrees in Technology and Strategy.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Dr. Shukla brings over a decade of experience in Operations Research, Statistics, and AI. Her work in Generative AI has significantly transformed how industries such as healthcare, cybersecurity, and infrastructure utilize advanced technology by bridging the gap between research, practice and deployment at scale. In addition to her technical expertise and numerous publications in esteemed journals, Dr. Shukla advocates for AI Security and Responsible AI.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Enterprise deployment of autonomous multi-agent systems (MAS) has surged, yet existing governance frameworks designed for traditional software or single-agent systems prove inadequate for managing emergent behaviors, coordination vulnerabilities, and distributed agency. We introduce \textbf{meta-governance}, by means of SafeAlign AI Governance and Responsible AI OS via the use of specialized intelligent agents to monitor and control operational agent fleets, as a scalable paradigm for achieving comprehensive Safety, Alignment, Governance, and Security (SAGS) in production MAS deployments. Through analysis of regulatory requirements (EU AI Act, NIST AI RMF, Singapore Framework), documented failure modes, and novel attack vectors, including inter-agent trust exploitation, we establish design principles for production-grade MAS governance systems. We validate these principles through deployment scenarios in regulated industries (financial services, healthcare, and pharmaceuticals), managing 100+ operational agents, demonstrating that meta-governance can achieve sub-second intervention latency, 100 % safety-critical policy compliance, and automated decision handling while maintaining comprehensive audit trails. Our framework addresses the fundamental asymmetry between attack propagation speed and human oversight capacity, enabling enterprises to deploy autonomous agents at scale with regulatory compliance and risk mitigation.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this workshop we’ll stand up NemoClaw end to end: install the reference stack, get OpenClaw running inside the OpenShell sandbox, configure inference routing, and lock down a network policy that survives a multi-hour agent session. We’ll walk through the blueprint, the CLI, and the approval flow, then run a real long-lived agent against it and break things on purpose so you know what the layers actually catch.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Mendelsohn is a Solutions Architect at Alation specializing in Applied AI, where he works at the intersection of product, engineering, and go-to-market strategy for agentic AI and data intelligence solutions. With over a decade of experience in data and AI, his career spans data engineering, data warehousing, consulting, and a tenure at Databricks before joining Alation
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Most large AI organizations have a data problem that isn’t what they think it is. The problem isn’t missing data — it’s data that exists, was built deliberately, is maintained by real people, and still isn’t being reused. It sits in a pipeline nobody else can find.
This talk examines what happens when a large-scale AI organization shifts from a project-centric to a product-centric operating model — and discovers that building data products doesn’t automatically mean anyone benefits from them. Across a portfolio of more than 140 data products supporting five distinct AI strategies, the patterns are consistent: data scientists rebuild ingestion pipelines for data that already exists, reusable frameworks go undiscovered by the teams who need them most, and the inventory grows while the acceleration effect of reuse does not.
This is not a clean success story. We’ll walk through what the product-centric model solved, what it didn’t, and what the discovery and trust gap actually costs — in terms practitioners recognize: the project that started from scratch because no one knew a shared datamart existed; the parser framework that sat idle while three teams built their own. We’ll then cover what it takes to close the gap: building a searchable, unified context layer that makes data products findable, evaluable, and reusable without requiring every team to know what every other team built.
Practitioners will leave with a diagnostic framework: how to distinguish a discovery problem from a data quality problem, the leading indicators that your organization has an invisible asset layer, and the ordering of interventions that helps — starting with discoverability, then trust signals, then governance.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
As AI models become more capable, the focus in production environments is shifting from full automation to effective human-AI collaboration. How should humans and AI systems work together on complex tasks that neither can optimally handle alone? This 1.5-hour workshop explores the ML foundations of collaborative intelligence. We will cover concrete production patterns for three areas: determining when models should defer to humans, communicating confidence and uncertainty, and utilizing training paradigms that produce models that are better collaborators (not just better predictors). I will ground these concepts in real-world production AI use-cases.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Nataliya Portman, Ph.D., is the Lead Data Scientist at CBC/Radio-Canada, where she builds intelligent systems to connect Canadians with meaningful content. With a career spanning neuroscience, biotech, automotive, and digital media, Nataliya leverages her expertise in advanced statistics, machine learning and AI to drive actionable business results. A University of Waterloo Applied Mathematics Ph.D. and former postdoctoral researcher at the Montreal Neurological Institute, Nataliya remains a dedicated advocate for math education.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
In this talk, I will showcase AI system that seamlessly integrates into our CDP platform and performs classification of users into categories according to their interests. I will focus on the GenAI prompt that acts as a high-precision reasoning engine uncovering consistent interaction patterns even when a user’s true interest is buried under other content. The prompt’s true value lies in its ability to find genuine enthusiasts who don’t engage through traditional channels like email – allowing us to reach out through a different channel.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
David Rosenberg leads the Machine Learning Strategy team in the Office of the CTO at Bloomberg. He was a co-author of the BloombergGPT research paper, which explored what it would take to build a large language model tailored to the financial domain. He was previously an adjunct associate professor at NYU’s Center for Data Science, where he twice received the “Professor of the Year” award. Before joining Bloomberg, David served as Chief Scientist at Sense Networks, a location data analytics and mobile advertising company. He holds a Ph.D. in statistics from UC Berkeley, an S.M. in applied mathematics from Harvard University, and a B.S. in mathematics from Yale University.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Reinforcement Learning for Large Language Models: A Modern View is a 3-hour tutorial for the Toronto Machine Learning Summit. It provides a motivation-first, mathematically rigorous introduction to reinforcement-learning-style post-training for large language models, aimed at machine learning researchers and advanced students who want a principled view of the methods behind modern LLM alignment and adaptation.
The tutorial starts with a brief overview of the contemporary LLM post-training pipeline and then develops the policy-gradient foundations needed to understand these methods from first principles. Instead of treating the field as a sequence of named algorithms, it organizes the material around the major design dimensions that distinguish practical approaches: how the training signal is obtained, how variance is reduced, how policy drift is controlled, how KL regularization is imposed and estimated, and how credit is assigned within a completion. Throughout, the tutorial emphasizes the tradeoffs encoded by these choices and distinguishes clearly between mathematically established results, theory-motivated arguments, practical heuristics, and empirical findings.
This perspective is then used to place methods such as REINFORCE, PPO-style RLHF, DPO, RLOO, and GRPO in a common mathematical framework, and to connect them to published descriptions of post-training in recent frontier models. Participants will leave with a unified understanding of the foundations of LLM post-training, the main ideas behind current methods, and the research questions that remain open.
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
Michael Havey is a data architect with thirty years of experience in graph databases, generative AI, data integration, application integration, and business process management. Michael is the author of two books and numerous articles on software design topics.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
An agent has a flow, and getting the flow right is critical. We can trust the agent’s result only if the path the agent took to get there aligns with our architectural intent. For years, BPM practitioners have faced this exact challenge with production workflows.
Most agent tools provide observability traces of the agent’s execution. This flow log gives useful raw data, but it would be advantageous to bring that data together to give us a picture of the path the agent usually takes. We borrow from BPM an algorithm called Process Mining, which uses the log to reconstruct the actual process flow. We can then compare that to the process flow we intended. Is the actual flow close enough or is it way off? Are there inefficiencies, such as superfluous tool executions, that we can try to reduce? Can we trim the flow to save cost and reduce latency?
I present results from an agent I built on AWS’s AgentCore service.
WHAT YOU’LL LEARN:
First, the recognition that agents are processes. Designing the process right is crucial.
Next, production-grade agents need observability. The agent publishes a raw trace, but there are proven algorithms, notably Process Mining, that can analyze and measure the overall process.
Finally, results from Process Mining help us compare the process we intended with the one that actually executes! This helps us determine whether we need to redesign the agent or just optimize it.
ABOUT THE SPEAKER:
Syed Shariyar Murtaza is an AI leader and innovator specializing in applied machine learning for life insurance, enterprise workflows, and intelligent systems. As an AVP of AI at Manulife Financial, he focuses on building real-world agentic AI solutions that transform business operations and decision-making. His work spans converting complex domain knowledge into executable workflows, designing advanced LLM-based underwriting systems, and creating benchmark datasets for evaluating AI in regulated environments. Shariyar holds a Ph.D. in Computer Science from the University of Western Ontario and is also an adjunct faculty member at Toronto Metropolitan University, where he teaches natural language processing and data science.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
Retrieval-augmented generation (RAG) enables generative AI models to extract accurate facts from external unstructured data sources. For structured data, RAG can be further enhanced with function (tool) calls to query databases. This paper presents an industrial case study of implementing RAG in the call center of a large financial institution. The study outlines the architecture and practical lessons learned from building a scalable RAG deployment. It also introduces enhancements for retrieving facts from structured data sources using data embeddings, achieving both low latency and high reliability. Our optimized production application demonstrates an average response time of only 7.33 seconds. Additionally, the paper compares various open‑source and closed‑source models for answer generation in an industrial environment. This paper is published in the prestigious NAACL (North American Chapter of the Association for Computational Linguistics) conference: https://aclanthology.org/2025.naacl-industry.48/
WHAT YOU’LL LEARN:
ABOUT THE SPEAKER:
I’m currently a Staff Research Scientist on the Globalization team at Netflix focused on multimodal LLMs. The Globalization team removes language barriers from the Netflix experience as we deliver movies and TV shows to 300+ million members across 190+ countries in 30+ languages. We are responsible for the translation and cultural adaptation of all aspects of member interaction on Netflix (e.g., subtitles and dubbing).
I earned my Ph.D. in Computer Science from Rice University in Houston, TX. My research interests are related to math and machine/deep learning, including non-convex optimization, theoretically-grounded algorithms for deep learning, continual learning, and practical tricks for building better systems with neural networks.
TALK TITLE:
TRACK:
SUB TOPIC:
ABSTRACT:
WHAT YOU’LL LEARN:
TMLS is Canada’s flagship summit for applied ML, AI infrastructure, and enterprise adoption. We bring together the researchers, practitioners, and leaders putting AI into practice across Canada. If you have real lessons, practical wins, or important research to share, we’d love to hear from you.
We’re looking for talks grounded in real work, from production systems and implementation challenges to research that helps the community understand what matters now and what comes next.
Business Leaders: C-Level Executives, Project Managers, and Product Owners will get to explore best practices, methodologies, principles, and practices for achieving ROI.
Engineers, Researchers, Data Practitioners: Will get a better understanding of the challenges, solutions, and ideas being offered via breakouts & workshops on Natural Language Processing, Neural Nets, Reinforcement Learning, Generative Adversarial Networks (GANs), Evolution Strategies, AutoML, and more.
Job Seekers: Will have the opportunity to network virtually and meet over 30+ Top Al Companies.
Ignite what is an Ignite Talk?
Ignite is an innovative and fast-paced style used to deliver a concise presentation.
During an Ignite Talk, presenters discuss their research using 20 image-centric slides which automatically advance every 15 seconds.
The result is a fun and engaging five-minute presentation.
You can see all our speakers and full agenda here