Pre-trained language models (LMs) perform well in In-Topic setups, where training and testing data come from the same topics. However, they face challenges in Cross-Topic scenarios where testing data is derived from distinct topics -- such as Gun Control. This study analyzes various LMs with three probing-based experiments to shed light on the reasons behind the In- vs. Cross-Topic generalization gap. Thereby, we demonstrate, for the first time, that generalization gaps and the robustness of the embedding space vary significantly across LMs. Additionally, we assess larger LMs and underscore the relevance of our analysis for recent models. Overall, diverse pre-training objectives, architectural regularization, or data deduplication contribute to more robust LMs and diminish generalization gaps. Our research contributes to a deeper understanding and comparison of language models across different generalization scenarios.
翻译:预训练语言模型(LM)在主题内设置中表现良好,其中训练和测试数据来自相同主题。然而,在跨主题场景中,当测试数据来自不同主题(例如枪支管控)时,它们面临挑战。本研究通过三项基于探测的实验分析多种语言模型,以阐明主题内与跨主题泛化差距背后的原因。由此,我们首次证明泛化差距及嵌入空间的鲁棒性在不同语言模型间存在显著差异。此外,我们评估了更大的语言模型,并强调了现有分析对近年模型的适用性。总体而言,多样化的预训练目标、架构正则化或数据去重有助于构建更鲁棒的语言模型,并缩小泛化差距。我们的研究促进了在不同泛化场景下对语言模型的更深入理解与比较。