In this study, we investigate how language models develop preferences for \textit{idiomatic} as compared to \textit{linguistically acceptable} Swedish, both during pretraining and when adapting a model from English to Swedish. To do so, we train models on Swedish from scratch and by fine-tuning English-pretrained models, probing their preferences at various checkpoints using minimal pairs that differ in linguistic acceptability or idiomaticity. For linguistic acceptability, we adapt existing benchmarks into a minimal-pair format. To assess idiomaticity, we introduce two novel datasets: one contrasting conventionalized idioms with plausible variants, and another contrasting idiomatic Swedish with Translationese. Our findings suggest that idiomatic competence emerges more slowly than other linguistic abilities, including grammatical and lexical correctness. While longer training yields diminishing returns for most tasks, idiom-related performance continues to improve, particularly in the largest model tested (8B). However, instruction tuning on data machine-translated from English -- the common approach for languages with little or no native instruction data -- causes models to rapidly lose their preference for idiomatic language.
翻译:本研究探讨了语言模型在预训练阶段以及从英语适应至瑞典语的过程中,如何形成对瑞典语中**习语性**表达相较于**语言可接受性**表达的偏好。为此,我们通过从头训练瑞典语模型以及对英语预训练模型进行微调两种方式,在不同训练检查点上使用在语言可接受性或习语性上存在差异的最小对来探测模型的偏好。针对语言可接受性,我们将现有基准数据集改编为最小对形式。为评估习语性,我们引入了两个新颖的数据集:一个对比常规化习语与其合理变体,另一个对比地道的瑞典语与翻译腔。我们的研究结果表明,习语能力的形成速度慢于其他语言能力,包括语法和词汇的正确性。虽然更长时间的训练对大多数任务带来的收益递减,但与习语相关的性能持续提升,尤其是在测试的最大模型(80亿参数)中。然而,在从英语机器翻译的数据上进行指令微调——这是针对缺乏或没有母语指令数据语言的常见方法——会导致模型迅速丧失对地道语言的偏好。