ConRetroBert reaches 75.4% accuracy on retrosynthesis by reframing template prediction as retrieval
Dual-encoder framework treats template-based retrosynthesis as dense retrieval plus listwise ranking, outperforming prior template methods on single-step reaction prediction.
ConRetroBert, a dual-encoder framework for template-based retrosynthesis, reframes the task as dense product-template retrieval followed by candidate set ranking rather than global classification over a long-tailed rule library. The approach reaches 75.4% top-1 accuracy when fine-tuned from a leakage-controlled USPTO-Full checkpoint on the USPTO-50k benchmark, a substantial improvement over the 50.5% baseline achieved by contrastive pretraining alone.
The framework operates in two stages. Stage 1 uses contrastive pretraining to learn a joint embedding space between product molecules and reaction templates. Stage 2 refines template ranking over hard-negative candidate sets mined from the retrieval bank, using a multi-positive listwise loss. To prevent instability during hard-negative mining, the system maintains a slow-moving exponential moving average (EMA) template encoder for retrieval bank construction while updating a live template encoder through the ranking loss. This EMA stabilization adds 1.1 percentage points to top-1 accuracy, lifting the Stage 2 result from 61.3% to 62.4% on USPTO-50k.
Strengths in the long tail
Retrieval-based template prediction proves particularly effective for rare templates, where global classification typically struggles. The authors also observe that many correct reactant predictions arise from alternative explicit templates rather than only the recorded positive label in the training data—a finding that suggests template-based methods preserve valuable chemical reasoning that template-free approaches may obscure. The work argues that template-based methods are not inherently less competitive than template-free models; the weakness lies in the learning formulation. By reframing the task as dense retrieval plus ranking, ConRetroBert narrows the gap while preserving the traceability of each prediction to a chemical transformation rule.
Code and data are available on GitHub. The preprint appears on arXiv.
