
Evaluation and Testing were the most frequently named constraints in our 2026 TMLS Steering Committee survey.
So our first committee meetings explored how we approach evals, specifically when dealing with uncertain metrics (specifically as they pertain to probabilistic agentic systems).
Here’s Part Two of what we learned.
Grounding works better than scoring
One of the more concrete approaches discussed involved building a structured facts library from trusted documents, and then validating agent responses against those facts rather than trying to assign an overall correctness score. The shift goes from “rate the output” to “check the output against something known.” This does not eliminate hallucination but it does make factuality “auditable”.
A related approach to summarization quality moved away from overall labels and toward identifying which specific part of a summary is wrong, then guiding the model to fix that part. The observation was that most of the generated summary is usually fine, but something gets injected in a sentence that is a misinterpretation. An overall score misses this entirely. Zooming in on the point of failure is more useful than scoring the whole thing.
Correlation is not explanation, and models are good at pretending
One of the strongest points in the session: explainability outputs can look convincing without being causal. Often models produce reasoning traces and feature attributions that appear to explain their behaviour, but when those features are tested through ablation (blanking out specific dimensions and checking whether the output changes) the explanation often falls apart.
The practical takeaway is that causality testing is the only way to build real trust in interpretability. Teams using sparse autoencoders to isolate which activation dimensions drive specific agent behaviours found that contrasted pairs (similar inputs that require different tool selections) were the most effective way to learn what the model is actually relying on. This approach is promising but not yet task-agnostic and does not currently scale easily for real-time inference.
The broader lesson for any team investing in explainability: if you haven’t tested whether removing a feature changes the output, you have correlation not explanation.
Speed and efficiency metrics can mask weak systems
When testing agents across models, teams observed that some models skip steps, make odd tool calls, or go straight to an answer without engaging in the expected reasoning loop. In some cases they get the right answer anyway, by luck or by shortcuts. Measuring only latency or throughput in tokens does not surface this. A fast, confident agent can still be a brittle agent.
The same pattern applies to deflection metrics. Fewer calls reaching a human agent looks like an efficiency gain, and it might be. But if the outcome signals (was the issue resolved, did the person drop off, are they satisfied) are never captured, the metric and reality can be decoupled unknowingly. The dashboard goes green while the (Customer/User) experience gets worse.
The lesson is that efficiency proxies need to be separated from outcome measures. Treating them as interchangeable is where things go wrong.
When automated signals fail, the fallback is human
Multiple participants arrived at the same conclusion from different directions. When the metric cannot be trusted, the most reliable thing to do is talk to the people affected by the system’s outputs.
This means going directly to customers and asking about their experience. Another observation was that every metric has a visible component and a hidden component, and the hidden component is almost always linked to human behaviour, preferences, and context that dashboards never capture. In regulated industries, where decisions have downstream consequences for people, this hidden layer matters even more.
This is not rocket science. But the fact that experienced practitioners can discuss proxy strategies, explainability frameworks, and causal testing, and still converge on “talk to your users” says something; this is still where the real signal lives.
These are 4 of the 10 takeaways noted in our meeting. In the next part, we will discuss;
- Not AI is not magic. Why Setting expectation is crucial
- Domain expertise is an evaluation method, not just an input
- Shipping under uncertainty is possible if the scope is controlled
Why this matters for TMLS
Our TMLS conference is built for Canadian AI practitioners, researchers, and leaders working in AI. Discussions like this shape the work, and they shape the kinds of conversations that are useful in the room.
How this shaped the 2026 program
This year’s program is organized across the three TMLS content categories.
- Technical / Engineering: Hands-on ML and GenAI implementations, agentic systems and coding
- Business / Executive / Product Strategy: Making sure value is secured across applications,
- Fundamental Research: Cutting-edge developments, novel evaluation methods and creative approaches to agents
Would you like to be a part of TMLS 2026? You can secure your passes here. We can also be reached at info@torotnomachinelearning.com