Dynamic Latent Routing outperforms supervised fine-tuning by 6.6 points on low-data tasks
A new language-model post-training method learns discrete routing policies through dynamic search, outperforming standard fine-tuning across six models when training data is scarce.

Dynamic Latent Routing (DLR) is a language-model post-training method that learns discrete latent codes and routing policies through dynamic search in a single training stage. Researchers Fangyuan Yu, Xin Su, and Amir Abdullah draw on temporal policy composition in Markov Decision Processes, using General Dijkstra Search (GDS) to recover optimal goal-reaching policies by chaining intermediate sub-policies.
The method addresses a persistent problem in low-data fine-tuning: prior discrete-latent baselines consistently underperform supervised fine-tuning. DLR matches or exceeds SFT across four datasets and six models, achieving a mean gain of 6.6 percentage points—a meaningful margin when training examples are limited.
Benchmark results
DLR's joint learning of codes, routing policies, and model parameters in one pass contrasts with staged approaches that learn latent representations first and then fine-tune. The preprint reports results across six language models and four datasets, all in low-data regimes where supervised fine-tuning typically dominates. Mechanistic analyses and targeted code ablations show that the learned routing behaviors have distinct causal roles, suggesting the method discovers structured decision paths rather than memorizing patterns.
The "search, select, update" principle underlying GDS translates into a training loop that continuously refines which latent codes to use and how to route through them. This post-training procedure scales better than SFT when labeled data is scarce—a common constraint in domain-specific fine-tuning and rapid prototyping. The ablations isolate each component's contribution, confirming that the routing mechanism itself drives the performance gap.