AutoMetrics: Approximate Human Judgments
with Automatically Generated Evaluators

ICLR 2026

Michael J. Ryan^♠, Yanzhe Zhang^♠, Amol Salunkhe^♣, Yi Chu^♣, Di Xu^♣, Diyi Yang^♠

^♠Stanford University ^♣American Express

Paper arXiv Code PyPI Try the demo BibTeX

AutoMetrics example: product descriptions with human feedback on the left, top five AutoMetrics on the right. — AutoMetrics takes you from expensive measures to interpretable automatic metrics. Here AutoMetrics generates useful metrics for evaluating LLM-written product descriptions from user reviews (EvalGen, Shankar et al., 2024). Percentages indicate the relative importance of each metric derived from regression coefficients.

Try AutoMetrics on your own data

Upload a small CSV of prompts, outputs, and human scores. We'll generate, select, and aggregate the metrics — no setup required.

Open the interactive demo

What is AutoMetrics?

Human feedback is the gold standard for evaluating subjective AI — and the hardest thing to collect. Reward models want thousands of labels. Rubric LLM-judges want someone to write the rubric. A sprinkle of thumbs-up/down tells you nothing about what users actually value.

AutoMetrics meets you where you are. Hand it a task description and fewer than 100 feedback points. It proposes candidate criteria with LLM judges, pulls complementary metrics from a curated bank of 48, and fits a PLS regression against your labels. You get back an interpretable linear combination of named metrics — one that correlates with human ratings up to 33.4% better than LLM-as-a-Judge baselines and, in our τ-Bench case study, stands in for a verifiable reward during agent optimization.

+33.4%

Kendall τ

Improvement over the closest LLM-as-a-Judge baseline on EvalGen — the largest out-of-distribution gain among the five evaluated tasks.

~80

labels suffice

Across SimpEval, HelpSteer2, and RealHumanEval, correlation with human judgments plateaus around 80 training examples.

τ-Bench

matches verifiable reward

Used as an optimization signal for a ReAct agent, AutoMetrics matches the accuracy gains from optimizing on a verifiable reward (0.72 vs. 0.68 after 2000 rollouts).

The AutoMetrics Method

Four-step AutoMetrics pipeline: Generate, Retrieve, Regress, Report. — AutoMetrics comprises four steps. (1) Generate: create task-specific candidate metrics (single criterion, rubric, example-based, MIPROv2-optimized). (2) Retrieve: from the generated candidates plus MetricBank, use ColBERT to prefilter to k′ metric cards and an LLM to select the final k. (3) Regress: fit a PLS model on the training set to weight and select metrics that predict human judgments. (4) Report: produce a writeup with weights, correlations, and details to guide adoption.

1. Generate

The default configuration produces 10 single-criterion, 5 rubric, 1 example-based, and 1 prompt-optimized LLM-as-a-Judge metric per run — validated across nearly 30 settings. Each generated metric ships with a Metric Card documenting description, intended use, implementation, and limitations.

2. Retrieve

Alongside generated metrics, AutoMetrics draws from MetricBank, a curated library of 48 metrics from the NLP literature, each with its own Metric Card. A hybrid ColBERT + LLM retriever uses the task description and target as the query and narrows down the candidate pool.

3. Regress

All metric scores are z-normalized and fit with Partial Least Squares regression. PLS handles the regime where predictors may outnumber observations and are correlated. A two-stage fit selects the top n metrics, then refits on that reduced set. Negatively correlated generated metrics are dropped (existing metrics are preserved — length can legitimately anti-correlate with conciseness).

4. Report

Every run emits (a) an aggregated metric as importable Python, (b) a Metric Card per generated metric, and (c) an interactive HTML Report Card with regression coefficients, correlation scatter, runtime distribution, per-example feedback, and p-value warnings when the selected metrics fail to reach statistical significance.

Results

Kendall's τ with 95% confidence intervals over 5 independent runs.

Method	In-Distribution		Out-of-Distribution
	SimpEval	HelpSteer2	EvalGen	RealHumanEval	CoGym
Model-agnostic
Best Existing Metric	0.246 ± 0.00	0.327 ± 0.00	0.193 ± 0.00	0.138 ± 0.00	0.074 ± 0.00
MetaMetrics	0.127 ± 0.01	0.204 ± 0.00	−0.214 ± 0.01	0.025 ± 0.01	−0.119 ± 0.02
Finetuned LLM	0.076 ± 0.08	0.039 ± 0.03	0.054 ± 0.05	0.049 ± 0.06	0.223 ± 0.20
GPT-4o-mini backbone
LLM-Judge	0.272 ± 0.02	0.259 ± 0.01	0.161 ± 0.14	0.069 ± 0.01	0.199 ± 0.13
DnA Eval	0.234 ± 0.03	0.255 ± 0.02	0.174 ± 0.16	0.152 ± 0.01	0.185 ± 0.10
AutoMetrics (ours)	0.321 ± 0.04	0.324 ± 0.01	0.334 ± 0.06	0.160 ± 0.00	−0.034 ± 0.17
Qwen-3-32B backbone
LLM-Judge	0.294 ± 0.04	0.334 ± 0.02	0.272 ± 0.13	0.025 ± 0.01	0.276 ± 0.19
DnA Eval	0.042 ± 0.04	0.260 ± 0.02	0.232 ± 0.19	0.071 ± 0.15	0.353 ± 0.25
AutoMetrics (ours)	0.316 ± 0.02	0.342 ± 0.01	0.382 ± 0.05	0.145 ± 0.00	0.365 ± 0.08

AutoMetrics outperforms every baseline on all five tasks with Qwen-3-32B and is within the 95% confidence interval of the best method for 4 of 5 tasks with GPT-4o-mini. On EvalGen it improves Kendall correlation by 33.4% over the closest baseline (LLM-Judge).

Sensitive and stable

We operationalize construct validity through two measurements:

Sensitivity: rate at which a metric assigns lower scores to degraded outputs.
Stability: degree to which scores are preserved under quality-neutral perturbations.

AutoMetrics detects negative perturbations in 81.0–97.8% of cases across tasks (well above the 50% baseline) and remains stable under quality-preserving edits by more than a 95% confidence interval.

Sensitivity and stability of AutoMetrics on SimpEval, HelpSteer2, and CoGym, above a normal-distribution baseline.

Kendall correlation vs. training-set size on SimpEval, HelpSteer2, and RealHumanEval, for Generated Only and Full metric banks.

~80 examples saturate performance

We vary the training-set size from 5 up to 640 examples on three datasets and plot correlation on the full test set.

Performance levels off after about 80 samples across datasets and settings. At very low sample counts (10–20) on out-of-distribution tasks, Generated Only can outperform the full MetricBank — which is why AutoMetrics defaults to Generated Only below this threshold.

Case Study: Optimizing an Agent with AutoMetrics

AutoMetrics as a proxy reward for τ-Bench airline-assistance agent optimization.

Three metrics AutoMetrics produced for τ-Bench, with their regression coefficients. — AutoMetrics recommends three metrics for τ-Bench — two rubric-based and one single-criterion — with regression coefficients shown in yellow.

τ-Bench test accuracy over GEPA optimization rollouts, comparing AutoMetrics against a verifiable reward and an unoptimized baseline. — τ-Bench accuracy during DSPy GEPA optimization. AutoMetrics reaches 0.72 ± 0.06 after 2000 rollouts; verifiable reward reaches 0.68 ± 0.11; the unoptimized baseline is 0.60.

We split the 50 τ-airline tasks into 25 training and 25 evaluation. AutoMetrics runs in Generate Only mode and allocates extra budget for generation (10→20 LLM-judge metrics; 5→8 rubric metrics). The final filtering step removes two metrics for having negative coefficients, leaving three. Using those three as the optimization signal for a DSPy + GEPA ReAct agent yields accuracy that statistically significantly exceeds the unoptimized baseline (p < 0.05) and matches the gain from optimizing against a verifiable reward.

BibTeX

@inproceedings{ryan2026autometrics,
  title   = {AutoMetrics: Approximate Human Judgments with Automatically Generated Evaluators},
  author  = {Ryan, Michael J and Zhang, Yanzhe and Salunkhe, Amol and Chu, Yi and Xu, Di and Yang, Diyi},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year    = {2026},
  url     = {https://openreview.net/forum?id=ymJuBifPUy}
}