The Biggest Constraint Facing the TMLS 2026 Committee, and What It Reveals About Evals (Pt. 1.)

Evaluation and testing were the most frequently named constraints in our 2026 TMLS steering committee survey.

We asked the question and The clearest signal was “evaluations”.

So our first committee meetings explored how we approach evals, specifically when dealing with uncertain metrics (specifically as they pertain to probabilistic agentic systems)

Metrics aren’t wrong. They’re incomplete.

Across every domain discussed, similar patterns arose. Teams are not working with metrics that give them bad information but metrics that give them partial information, and the missing part is usually the part that matters for the decision.

Fraud detection metrics exist but the ground truth on what is not fraud arrives too late to be operationally useful. Human-in-the-loop metrics capture how often a human overrides the model, but not whether that override was actually better. GPU utilization shows allocation, not productive use. Call deflection shows that fewer interactions reach a human, but not whether the customer’s issue was resolved. Recall looks strong on paper but does not reliably describe what is happening downstream.

Lesson from Meeting? The consistent lesson is that production metrics tend to measure something adjacent to the actual outcome. They measure activity, not effect. The gap between the two is where bad decisions happen.

Hallucination and subjective tasks resist stable measurement

Hallucination rates are a moving target. There is no stable definition that holds across teams, domains, or time. The metric shifts with the task, and attempts to pin it down tend to produce numbers that look precise but aren’t reliable. One observation raised in the session compared this to the Heisenberg uncertainty principle: the act of trying to observe a dynamic system too closely changes the system’s behaviour, particularly in reasoning traces.

The same problem applies to any task that involves subjective judgment. When the correct answer depends on tone, inflection, interpretation, or context, even human evaluators cannot agree. People try to objectify these assessments into quantified scores, but the result is a metric that gives the appearance of rigor, but still lacks substance. This is not a gap that better tooling will close. It is a property of the task.

Proxy strategies are the operating reality, not a fallback

When the ideal metric is not available, teams do not stop making decisions. They find something else to lean on. The more useful question is not “do you have the right metric” but “do you know what your proxy is actually measuring, and what it is missing?”

The proxy strategies that held up best in the discussion shared a few properties. They were deliberate, not accidental. They had known limits. And they were tied to a feedback loop that continuously updated the proxy over time.

Golden datasets combined with adversarial stress tests (prompt injection, corrupted inputs, edge cases) were the most commonly referenced approach. These are not perfect, and they go stale, but they provide a stable reference point when live metrics are noisy. The important thing is that failures from these tests get looped back into training data, RAG pipelines, and data strategy, closing the loop rather than just flagging problems.

Maintaining a rules-based or decision-tree baseline was raised by multiple participants. If the model cannot beat a simple baseline, that is a signal worth paying attention to. This prevents the common failure; using a fancy model that is actually worse than what came before.For tasks where multiple valid answers exist, teams are moving toward soft correctness: hierarchical scoring and degrees of right, rather than binary pass/fail. This is especially relevant for classification at different levels of a hierarchy, where several answers can be technically correct but at different levels of specificity. Binary evaluation on non-binary tasks produces misleading results.

These are 3 of the 10 takeaways noted in our meeting. In the next part, we will discuss;

– How Grounding Works better than scoring
– Correlation is not explanation, and models are good at pretending
– Speed and efficiency metrics can mask weak systems

Why this matters for TMLS

Our TMLS conference is built for Canadian AI practitioners, researchers, and leaders working in AI. Discussions like this shape the work, and they shape the kinds of conversations that are useful in the room.

How this shaped the 2026 program

This year’s program is organized across the three TMLS content categories.

Technical / Engineering: hands-on ML and GenAI implementations, agentic systems and coding
Business / Executive / Product Strategy: Making sure value is secured across applications,
Fundamental Research: Cutting-edge developments, novel evaluation methods and creative approaches to agents

This wraps up part 1 of our discussions on evals in the face of uncertain metrics. If you’d like to get involved in the committee or have something to share on stage, please let us know at info@torontomachinelearning.com

Toronto Machine Learning Summit

June 16th - 19th

The Biggest Constraint Facing the TMLS 2026 Committee, and What It Reveals About Evals (Pt. 1.)

Metrics aren’t wrong. They’re incomplete.

Hallucination and subjective tasks resist stable measurement

Proxy strategies are the operating reality, not a fallback

Why this matters for TMLS

How this shaped the 2026 program

TMLS

9th Annual:

TMLS

Stay up to date with all social invites and news for TMLS 2025

The Biggest Constraint Facing the TMLS 2026 Committee, and What It Reveals About Evals (Pt. 1.)

Metrics aren’t wrong. They’re incomplete.

Hallucination and subjective tasks resist stable measurement

Proxy strategies are the operating reality, not a fallback

Why this matters for TMLS

How this shaped the 2026 program

TMLS

Who Attends

2023 Event Demographics

2023 Technical Background

2023 Attendees & Thought Leadership