Kleinberg and Mullainathan showed that language generation in the limit is always possible at the level of computability: given enough positive examples, a learner can eventually generate data indistinguishable from a target language. However, such existence results do not address feasibility. We study the sample complexity of language generation in the limit for several canonical classes of formal languages. Our results show that infeasibility already appears for context-free and regular languages, and persists even for strict subclasses such as locally threshold testable languages, as well as for incomparable classes such as non-erasing pattern languages, a well-studied class in the theory of language identification. Overall, our results establish a clear gap between the theoretical possibility of language generation in the limit and its computational feasibility.
翻译:Kleinberg与Mullainathan的研究表明,从可计算性层面来看,极限语言生成始终是可能的:给定足够多的正例,学习者最终能够生成与目标语言无法区分的数据。然而,此类存在性结论并未涉及可行性问题。本文针对若干典型形式语言类,研究了极限语言生成的样本复杂度。我们的结果表明,不可行性已出现在上下文无关语言与正则语言中,并且持续存在于严格子类(如局部阈值可测试语言)以及不可比较的语言类(如非擦除模式语言——语言识别理论中一个被深入研究的类别)中。总体而言,我们的研究结果明确揭示了极限语言生成的理论可能性与其计算可行性之间存在显著差距。