Qwen 2.5 14B hits 80% HumanEval via self-correcting code training
A solo developer fine-tuned Qwen 2.5 base models on self-generated coding corrections, pushing the 14B version to 131/164 HumanEval problems without any human-written training data.

A MacBook and $3.50 of RunPod credits were enough to push Qwen 2.5 14B base to 131 out of 164 correct on HumanEval—an 80 percent pass rate—by training the model exclusively on its own coding mistakes. The method is straightforward: prompt the base model to invent a coding problem and write unit tests for it, generate multiple solution attempts, save pairs where one attempt passes and another fails, then fine-tune on those self-mined corrections. No human-written code examples. The Python interpreter was the only judge in the loop.
The first attempt on Qwen 2.5 7B base appeared to fail—the model's score dropped from 25 to 2 correct after training. A grader bug was truncating function outputs mid-stream, scoring incomplete halves as wrong. Once fixed, the 7B model jumped to 112 correct, an 87-problem gain. The 14B version followed the same recipe with 100 self-generated pairs and a 95-minute H100 run, landing at 131 correct—four points shy of Qwen's own RLHF-tuned 14B Instruct release and ahead of GPT-3.5 on math benchmarks.
The approach mirrors reasoning in the DeepSeek-R1 paper about models improving through verifiable rewards. Here the reward signal is pass-or-fail from a test suite the model wrote for itself. Coding problems have ground truth: the interpreter either runs or throws an error. The researcher shared training logs and grader code, noting the method works because of that binary feedback. The 14B checkpoint is not yet public, but reproduction steps are available for anyone with a 24GB GPU.