FeatCal cuts CLIP model merge performance gap to 85.5% with closed-form calibration

FeatCal is a post-merging calibration method that closes the performance gap between merged models and task experts using layer-by-layer weight updates, reaching 85.5% on CLIP-ViT-B/32 benchmarks with 256 examples per task in 53 seconds.

May 13, 2026

FeatCal cuts CLIP model merge performance gap to 85.5% with closed-form calibration

FeatCal, a post-merging calibration method from researchers Yanggan Gu, Shuo Cai, Zihao Wang, Wenjun Wang, Yuanyi Wang, and Pengkai Wang, addresses the performance gap between merged models and individual task experts. The technique uses a small calibration dataset to adjust merged model weights layer by layer in forward order, reaching 85.5% accuracy on CLIP-ViT-B/32 Task Arithmetic benchmarks and 85.2% on FLAN-T5-base GLUE, outperforming Surgery (77.0% and 83.7%) and ProbSurgery (78.8% and 82.2%) baselines.

Model merging combines multiple task-specific expert models into a single network without joint training or retraining, but the merged result typically underperforms the original experts. The authors frame this degradation as "feature drift" — the difference between features the merged model produces and those the expert would produce on identical input. Their theory decomposes drift into upstream propagation and local mismatch, tracking how errors compound through later layers and ultimately shift output predictions. FeatCal calibrates weights to reduce this drift while staying close to the merged initialization, preserving the efficiency benefits of merging.

The method uses a closed-form solution to update weights, requiring no gradient descent, iterative optimization, or additional modules. On CLIP-ViT-B/32, FeatCal achieves 82.9% accuracy with just 8 examples per task, and the full 256-example calibration completes in 53 seconds — roughly 4× faster than Surgery and ProbSurgery.

The next step is seeing whether FeatCal scales to larger merged models — the paper demonstrates ViT-B/32 and T5-base, but practitioners merging 7B or 70B parameter models will want calibration times and sample requirements at that scale. If the closed-form approach holds efficiency at higher parameter counts, it could become the default post-merge tuning pass for multi-task deployments.

More in Releases