This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.
翻译:本研究探讨了在由前代模型生成的合成数据上训练语言模型所带来的后果——鉴于当前强大生成模型的普及,这种做法日益普遍。与通常关注性能指标不同,我们聚焦于这种训练方法对语言多样性的影响,尤其是在递归应用时随时间推移的效应。为评估该影响,我们改编并开发了一套针对词汇、句法和语义多样性的新型指标,并将其应用于英语自然语言生成任务的递归微调实验中。我们的发现表明,随着迭代次数的增加,模型输出的多样性持续下降,在需要高创造力的任务中尤为显著。这一趋势凸显了在合成文本上训练语言模型的潜在风险,特别是对语言丰富性的维护。本研究强调,需审慎考虑此类训练方法对语言模型语言能力的长期影响。