TDV learns video embeddings without hand-tuned augmentations
Researchers at NYU and UIUC released Temporal Difference in Vision, a self-supervised framework that encodes frames and motion separately to train visual models on raw video without data augmentation.
Temporal Difference in Vision (TDV), a self-supervised learning framework from researchers at NYU and the University of Illinois Urbana-Champaign, trains visual models on video by splitting the task into two encoders: one for static frames, one for motion. Released on arXiv this week, the approach treats the next frame's latent state as the sum of the current frame embedding plus a compressed motion vector. That additive structure replaces the crop-based augmentations and masking strategies that earlier methods like DINO and iBOT rely on.
The paper argues that inductive biases imposed by augmentations—random crops, color jitter, multi-view consistency—become less useful as training data scales. TDV instead leans on temporal causality: the assumption that consecutive frames are related by a learnable motion delta. The team tested the claim on optical flow and stereo depth estimation, where TDV-trained encoders outperformed augmentation-heavy baselines without requiring any hand-tuned transformations during training.
What stands out
- 01No augmentation pipeline. TDV learns directly from raw video frames. The temporal-difference objective provides enough signal to train both encoders without needing crop strategies, color shifts, or masked-patch reconstruction.
- 02Additive motion encoding. The next-frame latent is modeled as
z_t+1 = z_t + Δz, wherez_tcomes from the frame encoder andΔzfrom the motion encoder. That linearity keeps the architecture simple and the motion representation compact. - 03Stronger on spatiotemporal tasks. On optical flow benchmarks, TDV encoders beat DINO and iBOT by measurable margins. The paper attributes this to preserving spatial structure that augmentation-based methods discard.




