TRADITIONAL ML

Latest Articles + Event News

TRADITIONAL ML

Evaluation and Testing were the most frequently named constraints in our 2026 TMLS Steering Committee survey.

So our first committee meetings explored how we approach evals, specifically when dealing with uncertain metrics (specifically as they pertain to probabilistic agentic systems).
Here’s Part Two of what we learned.

Domain expertise is an evaluation method, not just an input

A recurring point in the discussion was that domain experts serve a different function in production AI than they do in traditional ML. They are not just providing labels or training data. They are the evaluation layer.

Teams described using subject matter experts for periodic manual review and calibration. The question is not just “Is the model right” but “Would a human expert with 20 years in this domain give a different answer, and if so, why?” This reframes evaluation from a model metric problem into a human judgment problem, which is harder to automate but closer to what actually matters.

For tasks where ground truth is unstable, subjective, or context-dependent, getting expert agreement as the anchor is more reliable than trying to construct an automated score that will inevitably drift.

Shipping under uncertainty is possible if the scope is controlled

The question of whether to ship and learn or block and stabilize was raised but not resolved as a general principle. What did surface was one concrete approach: start with the parts of the task that the model gets right reliably, ship those, and handle the rest through a combination of traditional methods, extraction rules, and human review. Each iteration expands what the model handles.

This is not the same as shipping blindly. It is scoping the deployment to the areas where confidence is justified and keeping humans in the loop for the rest. The key discipline is knowing which regions of the task space are safe to automate and being honest about where they end. Of course, this is still a generalization and is subject to the conditions/constraints of the industry/vertical, as well as its inherent individual risk thresholds of the application area.

Not AI is not magic. Set expectations.

One point in the discussion was that executives often make decisions with equally unclear metrics and act on judgment anyway. The expectation that AI systems should produce precise, trustworthy metrics when human decision-makers do not have them either is itself a misalignment.

The practical implication: teams need to set expectations with clients and internal stakeholders that AI is a decision support tool with information and recommendations, not an oracle. If the people consuming the output expect certainty, they will lose trust the first time the system is wrong. If they expect informed guidance under uncertainty, the same system can be useful and is more sustainable.

This wraps up our 3-part series taken from our first committee discussion. If you’d like to get involved in the committee or have something to share on stage – please let us know at info@torontomachinelearning.com

When automated signals fail, the fallback is human

Multiple participants arrived at the same conclusion from different directions. When the metric cannot be trusted, the most reliable thing to do is talk to the people affected by the system’s outputs.

This means going directly to customers and asking about their experience. Another observation was that every metric has a visible component and a hidden component, and the hidden component is almost always linked to human behaviour, preferences, and context that dashboards never capture. In regulated industries, where decisions have downstream consequences for people, this hidden layer matters even more.

This is not rocket science. But the fact that experienced practitioners can discuss proxy strategies, explainability frameworks, and causal testing, and still converge on “talk to your users” says something; this is still where the real signal lives.

These are 4 of the 10 takeaways noted in our meeting. In the next part, we will discuss;

Not AI is not magic. Why Setting expectation is crucial

Domain expertise is an evaluation method, not just an input

Shipping under uncertainty is possible if the scope is controlled

Why this matters for TMLS

Our TMLS conference is built for Canadian AI practitioners, researchers, and leaders working in AI. Discussions like this shape the work, and they shape the kinds of conversations that are useful in the room.

How this shaped the 2026 program

This year’s program is organized across the three TMLS content categories.

Technical / Engineering: Hands-on ML and GenAI implementations, agentic systems and coding
Business / Executive / Product Strategy: Making sure value is secured across applications,
Fundamental Research: Cutting-edge developments, novel evaluation methods and creative approaches to agents

Would you like to be a part of TMLS 2026? You can secure your passes here. We can also be reached at info@torotnomachinelearning.com

Read Part 1

Read Part 2

News

Just Published: 5 Field Notes from Real Production ML/AI Systems

February 3, 2026

News

The Biggest Constraint Facing the TMLS 2026 Committee, and What It Reveals About Evals (Pt. 1.)

February 26, 2026

News

The Biggest Constraint Facing the TMLS 2026 Committee, And What It Reveals About Evals Pt. 3

March 13, 2026

Latest Articles + Event News

Domain expertise is an evaluation method, not just an input

Shipping under uncertainty is possible if the scope is controlled

Not AI is not magic. Set expectations.

When automated signals fail, the fallback is human

Why this matters for TMLS

How this shaped the 2026 program

More Articles

Who Attends

2023 Event Demographics

2023 Technical Background

2023 Attendees & Thought Leadership