Simon Rosen — Linguist & Evaluation Engineer

I work at the boundary between language and machine learning — designing the rubric frameworks, judge pipelines, and ground-truth benchmarks that determine whether voice and text AI agents actually work.

As a linguist, I decompose fuzzy notions like “naturalness” and “helpfulness” into scorable dimensions, then build the human-calibration and measurement pipelines that hold models to them.

Senior ML Data Linguist

ServiceNow · Santa Clara, CA

Feb 2026 — Present

Leading the design and execution of a ground-truth evaluation pipeline for agentic conversations — constructing a human-labeled benchmark dataset and validating it against LLM-as-judge and automatic metrics to assess model performance across dimensions.
Defining evaluation strategy for multi-turn agentic workflows — incident creation, HR case management, tool-calling flows — across voice and text modalities in a cascade framework.
Owning rubric standards and structured scoring criteria for response quality, instruction adherence, grounding fidelity, and conversational coherence.
Directing annotator calibration through rubric walkthroughs, adjudication workflows, and targeted re-evaluation cycles to maintain inter-rater reliability on high-ambiguity tasks.

ML Data Linguist

ServiceNow · Santa Clara, CA

Jan 2023 — Feb 2026

Authored LLM-as-judge prompts with structured output schemas integrated into internal evaluation tooling; wrote and published a prompt design principles guide adopted as a team-wide reference.
Designed and maintained five rubric and evaluation schema standards used across annotation teams for multi-turn conversational AI assessment.
Oversaw synthetic data generation pipelines to augment evaluation coverage and stress-test model behavior on underrepresented scenarios.
Owned annotation guidelines and ran recurring calibration sessions with reviewers, establishing consistent scoring on edge cases across enterprise workflow domains.
Cataloged systematic failure patterns — hallucination, under-specification, intent misclassification — and partnered with ML engineers to drive targeted retraining, prompt refinement, and production model selection.

Under review — ACL 2026 Beyond Naturalness: Probing Automated Text-to-Speech Evaluators on Linguistically Grounded Dimensions — designed the 10-dimension linguistically grounded schema and led the annotation of the 640-utterance dataset behind the first dimension-level meta-evaluation benchmark for TTS, used to audit MOS predictors and audio-LLM judges against human perception.

Evaluation & Metrics

LLM-as-judge prompt design
rubric architecture
multi-turn conversational evaluation
ground-truth dataset development
conversational failure taxonomy design
alignment assessment

Agent Systems

tool-calling evaluation
agentic workflow assessment
instruction adherence scoring
task completion analysis
multi-step agent audit

Speech & Language

prosody analysis
segmental & suprasegmental evaluation
TTS / ASR output assessment
pragmatics
discourse analysis

Operations & Collaboration

annotation guideline authoring
reviewer calibration protocols
inter-annotator agreement (Krippendorff’s α)
edge case adjudication
failure impact prioritization
cross-functional ML partnership

Technical

Python
structured prompt engineering
output schema design
YAML
Git
JSON schema design

2022

B.A. in Linguistics

Princeton University · Focus: Theoretical & Computational Linguistics

Thesis Tone Sandhi and Constituency in Classifier Phrases Across Sinitic Languages

phonetics & phonology · syntax · semantics · pragmatics · java programming · data structures & algorithms

Working on conversational AI evaluation, speech quality, or agent benchmarking? Let’s talk.

Email me LinkedIn