WARP recovers training data mixtures from model weights alone
New framework reconstructs domain proportions in fine-tuned models by interpolating weight space, achieving 0.046 MAE on BERT without access to training logs.

WARP, a framework from researchers at Stanford and UC Berkeley, reconstructs the training data recipe of a fine-tuned model using only its public weights. Foundation models are released openly, but the domain mixture weights that shaped their training—how much web text, code, or academic papers went into the final dataset—remain undisclosed. WARP sidesteps the need for training logs by interpolating between the base model and the fine-tuned checkpoint through model merging, generating pseudo-checkpoints that approximate the missing training trajectory. These synthetic snapshots expose a geometric signature in weight space that correlates with the underlying data distribution.
The framework extracts geometric features from the interpolated checkpoints and maps them to domain proportions using either a parameter-free softmax readout or a small MLP projector trained on synthetic mixtures. In controlled experiments, WARP recovered domain mixtures for BERT with a mean absolute error of 0.046 and for GPT-2 with an MAE of 0.104, outperforming membership inference baselines and even a variant with access to the true training trajectory. Unlike membership inference, which operates at the sample level, WARP characterizes the global composition of the training corpus—a capability that could help auditors, researchers, and regulators understand what data shaped a released model without requiring access to proprietary training logs.



