The Biggest Constraint Facing the TMLS 2026 Committee, And What It Reveals About Evals Pt. 3


Domain expertise is an evaluation method, not just an input

A recurring point in the discussion was that domain experts serve a different function in production AI than they do in traditional ML. They are not just providing labels or training data. They are the evaluation layer.

Teams described using subject matter experts for periodic manual review and calibration. The question is not just “Is the model right” but “Would a human expert with 20 years in this domain give a different answer, and if so, why?” This reframes evaluation from a model metric problem into a human judgment problem, which is harder to automate but closer to what actually matters.

For tasks where ground truth is unstable, subjective, or context-dependent, getting expert agreement as the anchor is more reliable than trying to construct an automated score that will inevitably drift.


Shipping under uncertainty is possible if the scope is controlled

The question of whether to ship and learn or block and stabilize was raised but not resolved as a general principle. What did surface was one concrete approach: start with the parts of the task that the model gets right reliably, ship those, and handle the rest through a combination of traditional methods, extraction rules, and human review. Each iteration expands what the model handles.

This is not the same as shipping blindly. It is scoping the deployment to the areas where confidence is justified and keeping humans in the loop for the rest. The key discipline is knowing which regions of the task space are safe to automate and being honest about where they end. Of course, this is still a generalization and is subject to the conditions/constraints of the industry/vertical, as well as its inherent individual risk thresholds of the application area.


Not AI is not magic. Set expectations.

One point in the discussion was that executives often make decisions with equally unclear metrics and act on judgment anyway. The expectation that AI systems should produce precise, trustworthy metrics when human decision-makers do not have them either is itself a misalignment.

The practical implication: teams need to set expectations with clients and internal stakeholders that AI is a decision support tool with information and recommendations, not an oracle. If the people consuming the output expect certainty, they will lose trust the first time the system is wrong. If they expect informed guidance under uncertainty, the same system can be useful and is more sustainable.

This wraps up our 3-part series taken from our first committee discussion. If you’d like to get involved in the committee or have something to share on stage – please let us know at info@torontomachinelearning.com


When automated signals fail, the fallback is human

Multiple participants arrived at the same conclusion from different directions. When the metric cannot be trusted, the most reliable thing to do is talk to the people affected by the system’s outputs.

This means going directly to customers and asking about their experience. Another observation was that every metric has a visible component and a hidden component, and the hidden component is almost always linked to human behaviour, preferences, and context that dashboards never capture. In regulated industries, where decisions have downstream consequences for people, this hidden layer matters even more.

This is not rocket science. But the fact that experienced practitioners can discuss proxy strategies, explainability frameworks, and causal testing, and still converge on “talk to your users” says something; this is still where the real signal lives.

These are 4 of the 10 takeaways noted in our meeting. In the next part, we will discuss;

  • Not AI is not magic. Why Setting expectation is crucial
  • Domain expertise is an evaluation method, not just an input
  • Shipping under uncertainty is possible if the scope is controlled

Why this matters for TMLS

Our TMLS conference is built for Canadian AI practitioners, researchers, and leaders working in AI. Discussions like this shape the work, and they shape the kinds of conversations that are useful in the room.

How this shaped the 2026 program

This year’s program is organized across the three TMLS content categories.

  • Technical / Engineering: Hands-on ML and GenAI implementations, agentic systems and coding
  • Business / Executive / Product Strategy: Making sure value is secured across applications,
  • Fundamental Research: Cutting-edge developments, novel evaluation methods and creative approaches to agents

Would you like to be a part of TMLS 2026? You can secure your passes here. We can also be reached at info@torotnomachinelearning.com

Read Part 1

Read Part 2

Table of Contents

Who Attends

Attendees
0 +
Data Practitioners
0 %
Researchers/Academics
0 %
Business Leaders
0 %

2023 Event Demographics

Technical practitioners working directly with ML/AI systems
0 %
Currently Working in Industry*
0 %
Attendees Looking for Solutions
0 %
Currently Hiring
0 %
Attendees Actively Job-Searching
0 %

2023 Technical Background

Expert/Researcher
14%
Advanced
37%
Intermediate
28%
Beginner
7%

2023 Attendees & Thought Leadership

Attendees
0 +
Speakers
0 +
Company Sponsors
0 +

Business Leaders: C-Level Executives, Project Managers, and Product Owners will get to explore best practices, methodologies, principles, and practices for achieving ROI.

Engineers, Researchers, Data Practitioners: Will get a better understanding of the challenges, solutions, and ideas being offered via breakouts & workshops on Natural Language Processing, Neural Nets, Reinforcement Learning, Generative Adversarial Networks (GANs), Evolution Strategies, AutoML, and more.

Job Seekers: Will have the opportunity to network virtually and meet over 30+ Top Al Companies.

Ignite what is an Ignite Talk?

Ignite is an innovative and fast-paced style used to deliver a concise presentation.

During an Ignite Talk, presenters discuss their research using 20 image-centric slides which automatically advance every 15 seconds.

The result is a fun and engaging five-minute presentation.

You can see all our speakers and full agenda here

Get our official conference app
For Blackberry or Windows Phone, Click here
For feature details, visit Whova