Recent works on language identification and generation have established tight statistical rates at which these tasks can be achieved. These works typically operate under a strong realizability assumption: that the input data is drawn from an unknown distribution necessarily supported on some language in a given collection. In this work, we relax this assumption of realizability entirely, and impose no restrictions on the distribution of the input data. We propose objectives to study both language identification and generation in this more general "agnostic" setup. Across both problems, we obtain novel interesting characterizations and nearly tight rates.
翻译:近期关于语言识别与生成的研究已为这些任务的可实现性建立了严格的统计速率边界。这些研究通常基于一个强可实现性假设:输入数据必然来自某个未知分布,且该分布必定以给定语言集合中的某种语言为支撑。本研究完全放宽了这一可实现性假设,对输入数据的分布不作任何限制。我们提出了在这一更广义的“不可知”设定下研究语言识别与生成的目标函数。针对这两个问题,我们均获得了新颖且具有启发性的理论特征描述,以及近乎紧致的速率边界。