FaithfulFaces preserves facial identity across pose shifts and occlusions in video generation
A new framework from researchers at multiple institutions tackles identity distortion in text-to-video generation when faces rotate or occlude, using a pose-shared dictionary and Euler angle embeddings.
FaithfulFaces, a pose-faithful identity preservation framework by Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng, Bing Ma, Kai Yu, and Sen Liang, addresses a persistent problem in identity-preserving text-to-video generation: faces that warp or lose consistency when the camera angle changes or when occlusions occur. The system centers on a pose-shared identity aligner that maps single-view facial inputs into a global pose representation using explicit Euler angle embeddings, then refines and aligns those poses across different views via a pose-shared dictionary and a pose variation-identity invariance constraint. That pose-faithful prior guides the generative model toward outputs that hold identity constant even as the face rotates or parts of it disappear behind objects.
Identity-preserving text-to-video generation has become a focus area as creators demand consistent character appearance across generated sequences. Earlier methods often held up well in frontal or near-frontal shots but collapsed when a character turned profile or when another object crossed in front of the face. FaithfulFaces attacks that weakness by explicitly modeling pose variation as a learnable global representation rather than treating each frame's face as an independent input. The pose-shared dictionary acts as a reference library of facial orientations, letting the aligner recognize that a three-quarter view and a profile view belong to the same identity even when pixel-level features differ.
The team built a specialized video dataset featuring substantial facial pose diversity to train the system. That dataset choice reflects a broader trend in video generation research: curating training material that exercises edge cases—extreme angles, partial occlusions, rapid head motion—rather than relying on existing datasets that skew toward static or frontal compositions. Experiments show FaithfulFaces outperforms prior identity-preserving text-to-video methods on both identity consistency and structural clarity metrics when tested on sequences with large pose changes and occlusions. The preprint was posted May 13, 2026; no code or model weights have been released yet.
