Transformer framework handles missing medical data without imputation
A new architecture fuses vision and tabular clinical data through masked self-attention, maintaining performance when entire modalities or individual features drop out during training or inference.
A transformer framework designed to handle missing medical imaging or structured clinical data without imputation or model switching has been detailed in a preprint posted to arXiv on May 13. The architecture integrates three encoders—vision, tabular, and multimodal fusion—and uses learnable modality tokens to weight unimodal representations before merging them via intermediate fusion with masked self-attention. When a modality or feature is absent, the masking mechanism excludes the corresponding tokens from both information aggregation and gradient propagation, letting the model operate on whatever subset of data is actually present.
Testing on MIMIC-CXR chest radiographs paired with structured clinical variables from MIMIC-IV targeted multilabel classification across 14 diagnostic findings with incomplete annotations. Two parallel stress-test protocols progressively increased training and inference missingness in each modality separately, spanning fully multimodal to fully unimodal scenarios. Across all regimes, the proposed method outperformed representative baselines, showing smoother performance degradation and improved robustness. A modality-dropout regularization strategy—stochastically removing available modalities during training—further encouraged the model to exploit complementary information under partial data availability. Ablation studies confirmed that attention-level masking and intermediate fusion with joint fine-tuning are key to resilient multimodal inference.
Most existing multimodal medical AI assumes complete modality availability, an assumption that rarely holds in real-world clinical settings where entire imaging studies or lab panels are frequently missing. The preprint does not include open weights or code, so practitioners cannot yet replicate the architecture or test it on their own clinical datasets. The next step to watch is whether the authors release an implementation or pretrained checkpoints that let hospital ML teams adapt the framework to their own data streams, and whether the attention-masking approach generalizes beyond vision-tabular pairs to other heterogeneous medical modalities like genomics, time-series vitals, or free-text notes.
