The Biggest Constraint Facing the TMLS 2026 Committee, And What It Reveals About Evals Pt. 2


Grounding works better than scoring

One of the more concrete approaches discussed involved building a structured facts library from trusted documents, and then validating agent responses against those facts rather than trying to assign an overall correctness score. The shift goes from “rate the output” to “check the output against something known.” This does not eliminate hallucination but it does make factuality “auditable”.

A related approach to summarization quality moved away from overall labels and toward identifying which specific part of a summary is wrong, then guiding the model to fix that part. The observation was that most of the generated summary is usually fine, but something gets injected in a sentence that is a misinterpretation. An overall score misses this entirely. Zooming in on the point of failure is more useful than scoring the whole thing.


Correlation is not explanation, and models are good at pretending

One of the strongest points in the session: explainability outputs can look convincing without being causal. Often models produce reasoning traces and feature attributions that appear to explain their behaviour, but when those features are tested through ablation (blanking out specific dimensions and checking whether the output changes) the explanation often falls apart.

The practical takeaway is that causality testing is the only way to build real trust in interpretability. Teams using sparse autoencoders to isolate which activation dimensions drive specific agent behaviours found that contrasted pairs (similar inputs that require different tool selections) were the most effective way to learn what the model is actually relying on. This approach is promising but not yet task-agnostic and does not currently scale easily for real-time inference.

The broader lesson for any team investing in explainability: if you haven’t tested whether removing a feature changes the output, you have correlation not explanation.


Speed and efficiency metrics can mask weak systems

When testing agents across models, teams observed that some models skip steps, make odd tool calls, or go straight to an answer without engaging in the expected reasoning loop. In some cases they get the right answer anyway, by luck or by shortcuts. Measuring only latency or throughput in tokens does not surface this. A fast, confident agent can still be a brittle agent.

The same pattern applies to deflection metrics. Fewer calls reaching a human agent looks like an efficiency gain, and it might be. But if the outcome signals (was the issue resolved, did the person drop off, are they satisfied) are never captured, the metric and reality can be decoupled unknowingly. The dashboard goes green while the (Customer/User) experience gets worse.

The lesson is that efficiency proxies need to be separated from outcome measures. Treating them as interchangeable is where things go wrong.


When automated signals fail, the fallback is human

Multiple participants arrived at the same conclusion from different directions. When the metric cannot be trusted, the most reliable thing to do is talk to the people affected by the system’s outputs.

This means going directly to customers and asking about their experience. Another observation was that every metric has a visible component and a hidden component, and the hidden component is almost always linked to human behaviour, preferences, and context that dashboards never capture. In regulated industries, where decisions have downstream consequences for people, this hidden layer matters even more.

This is not rocket science. But the fact that experienced practitioners can discuss proxy strategies, explainability frameworks, and causal testing, and still converge on “talk to your users” says something; this is still where the real signal lives.

These are 4 of the 10 takeaways noted in our meeting. In the next part, we will discuss;

  • Not AI is not magic. Why Setting expectation is crucial
  • Domain expertise is an evaluation method, not just an input
  • Shipping under uncertainty is possible if the scope is controlled

Why this matters for TMLS

Our TMLS conference is built for Canadian AI practitioners, researchers, and leaders working in AI. Discussions like this shape the work, and they shape the kinds of conversations that are useful in the room.

How this shaped the 2026 program

This year’s program is organized across the three TMLS content categories.

  • Technical / Engineering: Hands-on ML and GenAI implementations, agentic systems and coding
  • Business / Executive / Product Strategy: Making sure value is secured across applications,
  • Fundamental Research: Cutting-edge developments, novel evaluation methods and creative approaches to agents

Would you like to be a part of TMLS 2026? You can secure your passes here. We can also be reached at info@torotnomachinelearning.com

Read Part 1

Table of Contents

Who Attends

Attendees
0 +
Data Practitioners
0 %
Researchers/Academics
0 %
Business Leaders
0 %

2023 Event Demographics

Technical practitioners working directly with ML/AI systems
0 %
Currently Working in Industry*
0 %
Attendees Looking for Solutions
0 %
Currently Hiring
0 %
Attendees Actively Job-Searching
0 %

2023 Technical Background

Expert/Researcher
14%
Advanced
37%
Intermediate
28%
Beginner
7%

2023 Attendees & Thought Leadership

Attendees
0 +
Speakers
0 +
Company Sponsors
0 +

Business Leaders: C-Level Executives, Project Managers, and Product Owners will get to explore best practices, methodologies, principles, and practices for achieving ROI.

Engineers, Researchers, Data Practitioners: Will get a better understanding of the challenges, solutions, and ideas being offered via breakouts & workshops on Natural Language Processing, Neural Nets, Reinforcement Learning, Generative Adversarial Networks (GANs), Evolution Strategies, AutoML, and more.

Job Seekers: Will have the opportunity to network virtually and meet over 30+ Top Al Companies.

Ignite what is an Ignite Talk?

Ignite is an innovative and fast-paced style used to deliver a concise presentation.

During an Ignite Talk, presenters discuss their research using 20 image-centric slides which automatically advance every 15 seconds.

The result is a fun and engaging five-minute presentation.

You can see all our speakers and full agenda here

Get our official conference app
For Blackberry or Windows Phone, Click here
For feature details, visit Whova