基于部分枚举的语言生成与识别：紧密度界与拓扑特征 (Language Generation and Identification From Partial Enumeration: Tight Density Bounds and Topological Characterizations)

The success of large language models (LLMs) has motivated formal theories of language generation and learning. We study the framework of \emph{language generation in the limit}, where an adversary enumerates strings from an unknown language $K$ drawn from a countable class, and an algorithm must generate unseen strings from $K$. Prior work showed that generation is always possible, and that some algorithms achieve positive lower density, revealing a \emph{validity--breadth} trade-off between correctness and coverage. We resolve a main open question in this line, proving a tight bound of $1/2$ on the best achievable lower density. We then strengthen the model to allow \emph{partial enumeration}, where the adversary reveals only an infinite subset $C \subseteq K$. We show that generation in the limit remains achievable, and if $C$ has lower density $α$ in $K$, the algorithm's output achieves density at least $α/2$, matching the upper bound. This generalizes the $1/2$ bound to the partial-information setting, where the generator must recover within a factor $1/2$ of the revealed subset's density. We further revisit the classical Gold--Angluin model of \emph{language identification} under partial enumeration. We characterize when identification in the limit is possible -- when hypotheses $M_t$ eventually satisfy $C \subseteq M \subseteq K$ -- and in the process give a new topological formulation of Angluin's characterization, showing that her condition is precisely equivalent to an appropriate topological space having the $T_D$ separation property.

翻译：大型语言模型（LLMs）的成功推动了语言生成与学习的理论研究。我们研究\emph{极限语言生成}框架，其中对手从一个可数类中抽取未知语言$K$并枚举其字符串，算法必须生成$K$中未见的字符串。先前工作表明生成总是可行的，且某些算法能达到正下密度，揭示了正确性与覆盖范围之间的\emph{有效性-广度}权衡。我们解决了该领域的一个主要开放问题，证明了最佳可达下密度的紧界为$1/2$。随后我们强化模型以允许\emph{部分枚举}，即对手仅揭示$K$的无限子集$C \subseteq K$。我们证明极限生成仍然可实现，且若$C$在$K$中具有下密度$α$，算法的输出密度至少达到$α/2$，与上界匹配。这将$1/2$界推广至部分信息场景，其中生成器必须在揭示子集密度的$1/2$因子内恢复语言。我们进一步重访经典Gold–Angluin模型下的\emph{部分枚举语言识别}问题。我们刻画了极限识别的可行性条件——当假设$M_t$最终满足$C \subseteq M \subseteq K$时——在此过程中给出了Angluin特征的新拓扑表述，证明其条件精确等价于相应拓扑空间具有$T_D$分离性质。