Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are equally available at every layer in every model, including in an off-the-shelf control encoder, hence the gap not coming from representation quality. Instead, causal intervention shows that the scorer determines where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.
翻译:使用相同预训练编码器、数据和损失函数微调的作者身份归因模型,其性能差异可高达四倍,这完全取决于评分机制。我们利用机制可解释性工具解释这一差距。在包括现成控制编码器在内的每个模型的每一层中,词语长度、标点密度和功能词频率等风格特征均等可用,因此该差距并非来自表示质量。相反,因果干预表明,评分机制决定了编码器集中作者身份信号的位置。平均池化迫使信号在早中期层集中,而晚期交互则将其推迟到后期层。我们进一步从每个评分器的梯度结构中推导出这种差异,训练动态揭示了遵循该差异的不同学习轨迹。