
Evaluation and testing were the most frequently named constraints in our 2026 TMLS steering committee survey.
We asked the question and The clearest signal was “evaluations”.
So our first committee meetings explored how we approach evals, specifically when dealing with uncertain metrics (specifically as they pertain to probabilistic agentic systems)
Metrics aren’t wrong. They’re incomplete.
Across every domain discussed, similar patterns arose. Teams are not working with metrics that give them bad information but metrics that give them partial information, and the missing part is usually the part that matters for the decision.
Fraud detection metrics exist but the ground truth on what is not fraud arrives too late to be operationally useful. Human-in-the-loop metrics capture how often a human overrides the model, but not whether that override was actually better. GPU utilization shows allocation, not productive use. Call deflection shows that fewer interactions reach a human, but not whether the customer’s issue was resolved. Recall looks strong on paper but does not reliably describe what is happening downstream.
Lesson from Meeting? The consistent lesson is that production metrics tend to measure something adjacent to the actual outcome. They measure activity, not effect. The gap between the two is where bad decisions happen.
Hallucination and subjective tasks resist stable measurement
Hallucination rates are a moving target. There is no stable definition that holds across teams, domains, or time. The metric shifts with the task, and attempts to pin it down tend to produce numbers that look precise but aren’t reliable. One observation raised in the session compared this to the Heisenberg uncertainty principle: the act of trying to observe a dynamic system too closely changes the system’s behaviour, particularly in reasoning traces.
The same problem applies to any task that involves subjective judgment. When the correct answer depends on tone, inflection, interpretation, or context, even human evaluators cannot agree. People try to objectify these assessments into quantified scores, but the result is a metric that gives the appearance of rigor, but still lacks substance. This is not a gap that better tooling will close. It is a property of the task.
Proxy strategies are the operating reality, not a fallback
When the ideal metric is not available, teams do not stop making decisions. They find something else to lean on. The more useful question is not “do you have the right metric” but “do you know what your proxy is actually measuring, and what it is missing?”
The proxy strategies that held up best in the discussion shared a few properties. They were deliberate, not accidental. They had known limits. And they were tied to a feedback loop that continuously updated the proxy over time.
Golden datasets combined with adversarial stress tests (prompt injection, corrupted inputs, edge cases) were the most commonly referenced approach. These are not perfect, and they go stale, but they provide a stable reference point when live metrics are noisy. The important thing is that failures from these tests get looped back into training data, RAG pipelines, and data strategy, closing the loop rather than just flagging problems.
Maintaining a rules-based or decision-tree baseline was raised by multiple participants. If the model cannot beat a simple baseline, that is a signal worth paying attention to. This prevents the common failure; using a fancy model that is actually worse than what came before.For tasks where multiple valid answers exist, teams are moving toward soft correctness: hierarchical scoring and degrees of right, rather than binary pass/fail. This is especially relevant for classification at different levels of a hierarchy, where several answers can be technically correct but at different levels of specificity. Binary evaluation on non-binary tasks produces misleading results.
These are 3 of the 10 takeaways noted in our meeting. In the next part, we will discuss;
– How Grounding Works better than scoring
– Correlation is not explanation, and models are good at pretending
– Speed and efficiency metrics can mask weak systems
Why this matters for TMLS
Our TMLS conference is built for Canadian AI practitioners, researchers, and leaders working in AI. Discussions like this shape the work, and they shape the kinds of conversations that are useful in the room.
How this shaped the 2026 program
This year’s program is organized across the three TMLS content categories.
- Technical / Engineering: hands-on ML and GenAI implementations, agentic systems and coding
- Business / Executive / Product Strategy: Making sure value is secured across applications,
- Fundamental Research: Cutting-edge developments, novel evaluation methods and creative approaches to agents
This wraps up part 1 of our discussions on evals in the face of uncertain metrics. If you’d like to get involved in the committee or have something to share on stage, please let us know at info@torontomachinelearning.com