The Biggest Constraint Facing the TMLS 2026 Committee, and What It Reveals About Evals (Pt. 1.)


Metrics aren’t wrong. They’re incomplete.

Across every domain discussed, similar patterns arose. Teams are not working with metrics that give them bad information but metrics that give them partial information, and the missing part is usually the part that matters for the decision.

Fraud detection metrics exist but the ground truth on what is not fraud arrives too late to be operationally useful. Human-in-the-loop metrics capture how often a human overrides the model, but not whether that override was actually better. GPU utilization shows allocation, not productive use. Call deflection shows that fewer interactions reach a human, but not whether the customer’s issue was resolved. Recall looks strong on paper but does not reliably describe what is happening downstream.

Lesson from Meeting? The consistent lesson is that production metrics tend to measure something adjacent to the actual outcome. They measure activity, not effect. The gap between the two is where bad decisions happen.


Hallucination and subjective tasks resist stable measurement

Hallucination rates are a moving target. There is no stable definition that holds across teams, domains, or time. The metric shifts with the task, and attempts to pin it down tend to produce numbers that look precise but aren’t reliable. One observation raised in the session compared this to the Heisenberg uncertainty principle: the act of trying to observe a dynamic system too closely changes the system’s behaviour, particularly in reasoning traces.

The same problem applies to any task that involves subjective judgment. When the correct answer depends on tone, inflection, interpretation, or context, even human evaluators cannot agree. People try to objectify these assessments into quantified scores, but the result is a metric that gives the appearance of rigor, but still lacks substance. This is not a gap that better tooling will close. It is a property of the task.


Proxy strategies are the operating reality, not a fallback

When the ideal metric is not available, teams do not stop making decisions. They find something else to lean on. The more useful question is not “do you have the right metric” but “do you know what your proxy is actually measuring, and what it is missing?”

The proxy strategies that held up best in the discussion shared a few properties. They were deliberate, not accidental. They had known limits. And they were tied to a feedback loop that continuously updated the proxy over time.

Golden datasets combined with adversarial stress tests (prompt injection, corrupted inputs, edge cases) were the most commonly referenced approach. These are not perfect, and they go stale, but they provide a stable reference point when live metrics are noisy. The important thing is that failures from these tests get looped back into training data, RAG pipelines, and data strategy, closing the loop rather than just flagging problems.

Maintaining a rules-based or decision-tree baseline was raised by multiple participants. If the model cannot beat a simple baseline, that is a signal worth paying attention to. This prevents the common failure; using a fancy model that is actually worse than what came before.For tasks where multiple valid answers exist, teams are moving toward soft correctness: hierarchical scoring and degrees of right, rather than binary pass/fail. This is especially relevant for classification at different levels of a hierarchy, where several answers can be technically correct but at different levels of specificity. Binary evaluation on non-binary tasks produces misleading results.

These are 3 of the 10 takeaways noted in our meeting. In the next part, we will discuss;


–  How Grounding Works better than scoring
– Correlation is not explanation, and models are good at pretending
– Speed and efficiency metrics can mask weak systems


Why this matters for TMLS

Our TMLS conference is built for Canadian AI practitioners, researchers, and leaders working in AI. Discussions like this shape the work, and they shape the kinds of conversations that are useful in the room.


How this shaped the 2026 program

This year’s program is organized across the three TMLS content categories.

  • Technical / Engineering: hands-on ML and GenAI implementations, agentic systems and coding
  • Business / Executive / Product Strategy: Making sure value is secured across applications,
  • Fundamental Research: Cutting-edge developments, novel evaluation methods and creative approaches to agents

This wraps up part 1 of our discussions on evals in the face of uncertain metrics. If you’d like to get involved in the committee or have something to share on stage, please let us know at info@torontomachinelearning.com

Table of Contents

Who Attends

Attendees
0 +
Data Practitioners
0 %
Researchers/Academics
0 %
Business Leaders
0 %

2023 Event Demographics

Technical practitioners working directly with ML/AI systems
0 %
Currently Working in Industry*
0 %
Attendees Looking for Solutions
0 %
Currently Hiring
0 %
Attendees Actively Job-Searching
0 %

2023 Technical Background

Expert/Researcher
14%
Advanced
37%
Intermediate
28%
Beginner
7%

2023 Attendees & Thought Leadership

Attendees
0 +
Speakers
0 +
Company Sponsors
0 +

Business Leaders: C-Level Executives, Project Managers, and Product Owners will get to explore best practices, methodologies, principles, and practices for achieving ROI.

Engineers, Researchers, Data Practitioners: Will get a better understanding of the challenges, solutions, and ideas being offered via breakouts & workshops on Natural Language Processing, Neural Nets, Reinforcement Learning, Generative Adversarial Networks (GANs), Evolution Strategies, AutoML, and more.

Job Seekers: Will have the opportunity to network virtually and meet over 30+ Top Al Companies.

Ignite what is an Ignite Talk?

Ignite is an innovative and fast-paced style used to deliver a concise presentation.

During an Ignite Talk, presenters discuss their research using 20 image-centric slides which automatically advance every 15 seconds.

The result is a fun and engaging five-minute presentation.

You can see all our speakers and full agenda here

Get our official conference app
For Blackberry or Windows Phone, Click here
For feature details, visit Whova