O'Prior boosts tabular model accuracy by injecting realistic data irregularities into pretraining
New preprint isolates synthetic task distribution as the primary driver of tabular foundation model quality, showing that realism-injected priors outperform standard synthetic data when downstream tasks include confounding and missingness.
O'Prior, a compositional synthetic pretraining distribution introduced in a new arXiv preprint, addresses a fundamental gap in tabular foundation models: unlike language or vision models that learn from real-world corpora, tabular models acquire their inductive biases almost entirely from synthetic task distributions—yet those distributions are typically too clean to prepare the model for deployment irregularities.
Standard synthetic priors generate well-behaved tasks—smooth functions, complete observations, balanced classes—that omit the confounding, missingness, and support-query mismatch common in real tabular data. O'Prior's design introduces four coupled components: a hierarchical structural causal model meta-generator spanning diverse functional families; a modular realism engine covering heterogeneous marginals, missingness patterns, and target transforms; an explicit stress module that injects confounding and distribution shift; and a curriculum-governed generation protocol designed to prevent leakage. The researchers held architecture, optimizer, and compute budget constant across experiments and varied only the synthetic prior, isolating prior design as the scientific variable.
The paper reports consistent downstream gains on real tabular benchmarks, with improvements concentrated in regimes characterized by distributional irregularities. Ablations confirm that mechanism diversity, realism composition, and shift-aware stress each contribute independently—their effects are not interchangeable.
What stands out
- 01Gains concentrate where data is irregular. O'Prior's advantage over standard synthetic priors is largest on downstream tasks with missing values, confounded features, or train-test distribution shift. On clean, well-behaved test sets, the gap narrows.
- 02Realism and stress are not interchangeable. The modular realism engine—heterogeneous marginals, missingness, target transforms—and the explicit stress module—confounding, support-query mismatch—each contribute independently. Combining them yields additive gains; dropping either one degrades performance in different ways.
