Scoring mechanism alone drives 4× gap in authorship attribution accuracy
Mechanistic interpretability study finds that mean pooling versus late interaction determines where encoder-based language models consolidate authorship signal, explaining a 4× performance gap between otherwise identical fine-tuned models.

Authorship attribution models fine-tuned from the same pretrained encoder, trained on the same data with the same loss function, can exhibit a four-fold performance gap depending solely on their scoring mechanism. A new preprint from Francis Kulumba, Guillaume Vimont, Laurent Romary, and Florian Cafiero uses mechanistic interpretability tools to explain why.
Stylistic features—word length, punctuation density, function-word frequency—are equally available at every layer in every model tested, including an off-the-shelf control encoder. Representation quality is not the bottleneck. Instead, causal intervention experiments reveal that the scorer determines where the encoder consolidates authorship signal during training.
Gradient structure and layer depth
Mean pooling forces the encoder to consolidate authorship information by early to mid layers. Late interaction, by contrast, defers consolidation to later layers. The authors derive this difference from the gradient structure of each scorer: mean pooling backpropagates a uniform signal across all tokens, while late interaction produces a sparse gradient that concentrates updates deeper in the network. Training dynamics confirm that the two architectures follow distinct learning trajectories from the start.
For practitioners building authorship classifiers or stylometric tools, the choice between mean pooling and late interaction is not a neutral hyperparameter. It effectively selects which layers will carry the signal—and that choice alone can swing accuracy by a factor of four.