This study investigates the consequences of training large language models (LLMs) on synthetic data generated by their predecessors, an increasingly prevalent practice aimed at addressing the limited supply of human-generated training data. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we developed a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive fine-tuning experiments across various natural language generation tasks. Our findings reveal a marked decrease in the diversity of the models' outputs through successive iterations. This trend underscores the potential risks of training LLMs on predecessor-generated text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of LLMs.
翻译:本研究探讨了在大语言模型(LLMs)训练中使用其前代模型生成的合成数据所产生的后果——这一日益普遍的做法旨在解决人类生成训练数据供应有限的问题。与通常关注性能指标不同,我们聚焦于这种训练方法对语言多样性的影响,特别是当递归持续进行时。为此,我们开发了一系列针对词汇、句法和语义多样性的新指标,并将其应用于多种自然语言生成任务的递归微调实验中。研究发现,随着迭代次数的增加,模型输出的多样性显著下降。这一趋势揭示了使用前代模型生成的文本训练LLMs的潜在风险,尤其是在语言丰富性保存方面。我们的研究强调,需要审慎考虑此类训练方法对LLMs语言能力产生的长期影响。