Specifying all desirable properties of a language model is challenging, but certain requirements seem essential. Given samples from an unknown language, the trained model should produce valid strings not seen in training and be expressive enough to capture the language's full richness. Otherwise, outputting invalid strings constitutes "hallucination," and failing to capture the full range leads to "mode collapse." We ask if a language model can meet both requirements. We investigate this within a statistical language generation setting building on Gold and Angluin. Here, the model receives random samples from a distribution over an unknown language K, which belongs to a possibly infinite collection of languages. The goal is to generate unseen strings from K. We say the model generates from K with consistency and breadth if, as training size increases, its output converges to all unseen strings in K. Kleinberg and Mullainathan [KM24] asked if consistency and breadth in language generation are possible. We answer this negatively: for a large class of language models, including next-token prediction models, this is impossible for most collections of candidate languages. This contrasts with [KM24]'s result, showing consistent generation without breadth is possible for any countable collection of languages. Our finding highlights that generation with breadth fundamentally differs from generation without breadth. As a byproduct, we establish near-tight bounds on the number of samples needed for generation with or without breadth. Finally, our results offer hope: consistent generation with breadth is achievable for any countable collection of languages when negative examples (strings outside K) are available alongside positive ones. This suggests that post-training feedback, which encodes negative examples, can be crucial in reducing hallucinations while limiting mode collapse.
翻译:明确语言模型的所有理想特性具有挑战性,但某些要求似乎至关重要。给定来自未知语言的样本,训练后的模型应能生成训练中未见过的有效字符串,并具备足够的表达能力以捕捉该语言的全部丰富性。否则,输出无效字符串构成"幻觉",而未能捕捉全部范围则导致"模式崩溃"。我们探讨语言模型能否同时满足这两项要求。我们在基于Gold和Angluin的统计语言生成框架下研究此问题。在此框架中,模型接收来自未知语言K上分布的随机样本,该语言属于一个可能无限的语言集合。目标是生成K中未见的字符串。若随着训练规模增大,模型输出收敛至K中所有未见字符串,则称该模型以一致性和广度生成自K。Kleinberg与Mullainathan [KM24] 曾质疑语言生成中一致性与广度是否可能兼得。我们对此给出否定答案:对于包括下一词预测模型在内的一大类语言模型,在大多数候选语言集合中这是不可能的。这与[KM24]的结果形成对比——他们证明对于任何可数语言集合,无广度的一致性生成是可能的。我们的发现凸显了有广度生成与无广度生成存在本质差异。作为副产品,我们建立了关于实现有/无广度生成所需样本数量的近乎紧确的界限。最后,我们的研究结果带来希望:当负例(K之外的字符串)与正例同时可用时,对于任何可数语言集合,实现兼具一致性与广度的生成是可能的。这表明编码负例信息的训练后反馈机制,对于在限制模式崩溃的同时减少幻觉可能至关重要。