I work at the boundary between language and machine learning — designing the rubric frameworks, judge pipelines, and ground-truth benchmarks that determine whether voice and text AI agents actually work.
As a linguist, I decompose fuzzy notions like “naturalness” and “helpfulness” into scorable dimensions, then build the human-calibration and measurement pipelines that hold models to them.
Senior ML Data Linguist
ServiceNow · Santa Clara, CA
Feb 2026 — Present
- Leading the design and execution of a ground-truth evaluation pipeline for agentic conversations — constructing a human-labeled benchmark dataset and validating it against LLM-as-judge and automatic metrics to assess model performance across dimensions.
- Defining evaluation strategy for multi-turn agentic workflows — incident creation, HR case management, tool-calling flows — across voice and text modalities in a cascade framework.
- Owning rubric standards and structured scoring criteria for response quality, instruction adherence, grounding fidelity, and conversational coherence.
- Directing annotator calibration through rubric walkthroughs, adjudication workflows, and targeted re-evaluation cycles to maintain inter-rater reliability on high-ambiguity tasks.
ML Data Linguist
ServiceNow · Santa Clara, CA
Jan 2023 — Feb 2026
- Authored LLM-as-judge prompts with structured output schemas integrated into internal evaluation tooling; wrote and published a prompt design principles guide adopted as a team-wide reference.
- Designed and maintained five rubric and evaluation schema standards used across annotation teams for multi-turn conversational AI assessment.
- Oversaw synthetic data generation pipelines to augment evaluation coverage and stress-test model behavior on underrepresented scenarios.
- Owned annotation guidelines and ran recurring calibration sessions with reviewers, establishing consistent scoring on edge cases across enterprise workflow domains.
- Cataloged systematic failure patterns — hallucination, under-specification, intent misclassification — and partnered with ML engineers to drive targeted retraining, prompt refinement, and production model selection.
Under review — ACL 2026
Beyond Naturalness: Probing Automated Text-to-Speech Evaluators on Linguistically Grounded Dimensions — designed the 10-dimension linguistically grounded schema and led the annotation of the 640-utterance dataset behind the first dimension-level meta-evaluation benchmark for TTS, used to audit MOS predictors and audio-LLM judges against human perception.
Evaluation & Metrics
- LLM-as-judge prompt design
- rubric architecture
- multi-turn conversational evaluation
- ground-truth dataset development
- conversational failure taxonomy design
- alignment assessment
Agent Systems
- tool-calling evaluation
- agentic workflow assessment
- instruction adherence scoring
- task completion analysis
- multi-step agent audit
Speech & Language
- prosody analysis
- segmental & suprasegmental evaluation
- TTS / ASR output assessment
- pragmatics
- discourse analysis
Operations & Collaboration
- annotation guideline authoring
- reviewer calibration protocols
- inter-annotator agreement (Krippendorff’s α)
- edge case adjudication
- failure impact prioritization
- cross-functional ML partnership
Technical
- Python
- structured prompt engineering
- output schema design
- YAML
- Git
- JSON schema design
2022
B.A. in Linguistics
Princeton University · Focus: Theoretical & Computational Linguistics
Thesis
Tone Sandhi and Constituency in Classifier Phrases Across Sinitic Languages
phonetics & phonology · syntax · semantics · pragmatics · java programming · data structures & algorithms
Working on conversational AI evaluation, speech quality, or agent benchmarking? Let’s talk.