Hanieh Arjmand

Lead AI Engineer,

Chubb

ABOUT THE SPEAKER:

Hannah Arjmand is a Lead AI Engineer with a Ph.D. in Biomedical Engineering from the University of Toronto. She leads the development of LLM systems in regulated industries, with a focus on post-training and evaluation. Her track record spans healthcare AI and enterprise insurance applications, and includes a filed patent in multimodal AI and peer-reviewed publications. Hannah is a regular presenter at applied AI conferences.

TALK TITLE:

A Multi-Stage Framework for Instruction-Based Evaluation of LLM Outputs

TRACK:

Technical / Engineering Talks

SUB TOPIC:

Evaluation Methods & Capability Benchmarking

ABSTRACT:

Deploying large language models (LLMs) in regulated decision-support settings presents a unique evaluation challenge: models follow explicit, multi-step instructions that produce domain-specific outputs, not simple classifications. Standard benchmarks do not capture whether a model correctly applies prescribed reasoning steps, produces recommendations from task-specific taxonomies, or maintains consistency with established decision criteria. When transitioning between production models, evaluation must assess both correctness against ground truth and relative output quality, often with limited labeled data.

We propose a multi-stage evaluation framework developed during a production model transition from a dense Model A to Model B. Our system generates structured assessments across multiple task domains, where each decision task has distinct instruction sets, output formats, and recommendation categories.

The framework addresses three core problems. First, instruction-aware label extraction: since model outputs are free-text narratives rather than structured labels, we use a secondary LLM classifier with task-specific prompts derived from the same instructions given to the primary model, ensuring extracted labels align with the intended recommendation taxonomy. We show that naive mappings misrepresent model accuracy and that aligning extraction categories to prompt instructions improved measurement fidelity. Second, complementary evaluation under label scarcity: we combine offline accuracy on expert-labeled data with pairwise LLM-as-judge comparisons on unlabeled production data, providing both absolute and relative quality signals. Third, training data evolution during model transitions: each model interprets instructions through its own learned style, structuring outputs differently, emphasizing different aspects of the prompt, and producing distinct narrative patterns even when given identical instructions. When the new model’s outputs are used to generate training data for future iterations, these stylistic differences propagate into the ground truth. Annotators reviewing outputs must recalibrate to the new model’s conventions, and labels created under Model A may not transfer cleanly to Model B. We found that switching models requires regenerating outputs for annotator review and updating training data to reflect the new model’s instruction-following behavior, rather than assuming compatibility with existing annotations.

We identify several limitations of LLM-as-judge evaluation that practitioners should account for. The judge exhibited verbosity bias, preferring longer, more detailed outputs regardless of correctness, which risks rewarding over-generation over precision. The judge also showed limited domain calibration: it could identify structural and stylistic differences between outputs but struggled to assess whether a specific recommendation was appropriate given the underlying data, a judgment that requires domain expertise the judge model lacks. Finally, the judge’s quality preferences did not always align with recommendation accuracy. In one task domain, the judge preferred Model B’s outputs 64.3% of the time, yet Model A had higher accuracy on the overall recommendation task, highlighting that perceived quality and decision correctness are distinct dimensions that require separate measurement.

Across several hundred labeled samples spanning multiple decision types, our framework revealed performance differences obscured by earlier approaches, including a failure mode where one model defaulted to a single prediction class on 96% of inputs for one task, visible only after correcting the label taxonomy. We discuss implications for practitioners evaluating LLMs in instruction-heavy, domain-specific production settings where ground truth is scarce and automated judges are imperfect proxies for expert assessment.

WHAT YOU’LL LEARN:

TBA

Hanieh Arjmand

Who Attends

2023 Event Demographics

2023 Technical Background

2023 Attendees & Thought Leadership