RewardHarness achieves 47.4% accuracy on image edits using 100 examples, beats GPT-5
A new agentic reward framework learns to evaluate image edits from just 100 preference demonstrations, outperforming GPT-5 by 5.3 points by evolving tools and reasoning chains instead of fine-tuning weights.

"Humans can infer evaluation criteria from a handful of examples, yet AI systems typically require hundreds of thousands of labeled comparisons," the gap that RewardHarness aims to close. Researchers have long faced this data-efficiency puzzle in reward modeling—and a new preprint suggests agentic reasoning over curated skill libraries can match or exceed models trained on orders of magnitude more labeled data.
RewardHarness, authored by Yuxuan Zhang, Penghui Du, Bo Li, Cong Wei, Junwen Miao, and Huaisong Zhang, treats reward modeling as context evolution rather than weight optimization. The framework operates in two parts: an Orchestrator maintains a library of tools and skills, selecting the most relevant subset for each image-editing evaluation task. A frozen Sub-Agent then uses those tools to construct a reasoning chain that judges whether an edit matches the original instruction—comparing the source image, candidate edits, and the editing directive. When the predicted judgment diverges from ground truth, the Orchestrator automatically refines its library without additional human annotation.
Using only 0.05% of the EditReward preference dataset—equivalent to learning from roughly 100 preference demonstrations—RewardHarness reached 47.4% average accuracy on image-editing evaluation benchmarks, outperforming GPT-5 by 5.3 percentage points. When deployed as a reward signal for GRPO fine-tuning, RL-tuned models scored 3.52 on ImgEdit-Bench. The results suggest that agentic reasoning over a curated skill library can unlock more sample-efficient reward modeling, potentially extending the approach across other domains beyond image editing.