Reflective Prompt Tuning boosts LLM reasoning 12.9 points without retraining
Researchers propose RPT, a framework that uses LLM function calls to diagnose failure patterns and revise prompts iteratively, improving reasoning task performance by up to 12.9 points without parameter updates.

Reflective Prompt Tuning (RPT) is a prompt optimization framework that automates the trial-and-error work of prompt engineering by treating an LLM as both optimizer and diagnostic engine. Instead of searching over candidate prompts or running fixed critique loops on individual examples, RPT calls a diagnostic function that evaluates the target model across an entire optimization set, identifies recurring failure modes, and returns a structured report. The optimizer LLM reads that report—plus a memory of prior diagnostic cycles—and revises the prompt for the next iteration, mimicking how a human prompt engineer would debug systematically.
Across three reasoning tasks, RPT improves over baseline prompts by up to 12.9 points, with the largest gains on multi-hop and mathematical reasoning. The framework integrates confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection, leading to better-calibrated outputs alongside higher accuracy. RPT remains competitive with state-of-the-art automated prompt methods while requiring no parameter updates and preserving inference-time flexibility. The diagnostic-report structure allows the optimizer to make targeted edits—adding constraints, reordering instructions, or clarifying ambiguous phrasing—that align with specific failure patterns surfaced in each cycle.


