RubricEM-8B trains research agents with self-generated rubrics and meta-learning
A new meta-RL framework uses self-generated rubrics to structure long-horizon research tasks, outperforming comparable open models on four benchmarks.

RubricEM is a reinforcement learning framework that trains AI research agents—systems that plan, search, evaluate evidence, and write long-form reports—by treating rubrics as the shared interface for policy execution, judge feedback, and agent memory. The authors argue that research tasks push RL beyond verifiable rewards: outputs lack ground-truth answers, trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. The resulting 8-billion-parameter model, RubricEM-8B, achieves strong performance across four long-form research benchmarks.
The framework makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts.
Benchmark results
RubricEM-8B outperforms comparable open models and approaches proprietary deep-research systems on four long-form research benchmarks. The paper includes thorough analyses of the key ingredients that drive performance. Authors Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, and Chun-Liang Li detail the stagewise decomposition and reflection mechanisms that enable the model to convert past attempts into reusable guidance for future research tasks.