Language Generation in the Limit: Noise, Loss, and Feedback

Kleinberg and Mullainathan (2024) recently proposed a formal framework called language generation in the limit and showed that given a sequence of example strings from an unknown target language drawn from any countable collection, an algorithm can correctly generate unseen strings from the target language within finite time. This notion was further refined by Li, Raman, and Tewari (2024), who defined stricter categories of non-uniform and uniform generation. They showed that a finite union of uniformly generatable collections is generatable in the limit, and asked if the same is true for non-uniform generation. We begin by resolving the question in the negative: we give a uniformly generatable collection and a non-uniformly generatable collection whose union is not generatable in the limit. We then use facets of this construction to further our understanding of several variants of language generation. The first two, generation with noise and without samples, were introduced by Raman and Raman (2025) and Li, Raman, and Tewari (2024) respectively. We show the equivalence of these models for uniform and non-uniform generation, and provide a characterization of non-uniform noisy generation. The former paper asked if there is any separation between noisy and non-noisy generation in the limit -- we show that such a separation exists even with a single noisy string. Finally, we study the framework of generation with feedback, introduced by Charikar and Pabbaraju (2025), where the algorithm is strengthened by allowing it to ask membership queries. We show finite queries add no power, but infinite queries yield a strictly more powerful model. In summary, the results in this paper resolve the union-closedness of language generation in the limit, and leverage those techniques (and others) to give precise characterizations for natural variants that incorporate noise, loss, and feedback.

翻译：Kleinberg与Mullainathan（2024）近期提出了一个名为“极限下的语言生成”的形式化框架，并证明：对于从任意可数集合中抽取的未知目标语言所生成的示例字符串序列，存在算法能在有限时间内正确生成目标语言中未见过的字符串。Li、Raman与Tewari（2024）进一步细化了这一概念，定义了更严格的分类型：非均匀生成与均匀生成。他们证明有限个均匀可生成集合的并集在极限意义下是可生成的，并提出疑问：非均匀生成是否具有相同性质？我们首先否定了该问题：我们构造了一个均匀可生成集合与一个非均匀可生成集合，使得二者的并集在极限意义下不可生成。随后，我们利用该构造的多个维度深化对语言生成若干变体的理解。前两种变体——带噪声的生成与无样本生成——分别由Raman与Raman（2025）以及Li、Raman与Tewari（2024）提出。我们证明了这些模型在均匀与非均匀生成场景下的等价性，并给出了非均匀噪声生成的特征刻画。前文曾质疑带噪声生成与无噪声生成在极限框架下是否存在分离——我们证明即使仅存在单个噪声字符串，此类分离依然存在。最后，我们研究了Charikar与Pabbaraju（2025）提出的带反馈生成框架，该框架通过允许算法提出成员查询来增强其能力。我们证明有限次查询不增加生成能力，但无限次查询会产生严格更强大的模型。综上所述，本文结果解决了极限下语言生成的并集封闭性问题，并运用相关技术（及其他方法）对融合噪声、损失与反馈的自然变体给出了精确的特征刻画。