Neural parameterization has significantly advanced unsupervised grammar induction. However, training these models with a traditional likelihood loss for all possible parses exacerbates two issues: 1) $\textit{structural optimization ambiguity}$ that arbitrarily selects one among structurally ambiguous optimal grammars despite the specific preference of gold parses, and 2) $\textit{structural simplicity bias}$ that leads a model to underutilize rules to compose parse trees. These challenges subject unsupervised neural grammar induction (UNGI) to inevitable prediction errors, high variance, and the necessity for extensive grammars to achieve accurate predictions. This paper tackles these issues, offering a comprehensive analysis of their origins. As a solution, we introduce $\textit{sentence-wise parse-focusing}$ to reduce the parse pool per sentence for loss evaluation, using the structural bias from pre-trained parsers on the same dataset. In unsupervised parsing benchmark tests, our method significantly improves performance while effectively reducing variance and bias toward overly simplistic parses. Our research promotes learning more compact, accurate, and consistent explicit grammars, facilitating better interpretability.
翻译:神经参数化技术显著推动了无监督语法归纳的发展。然而,使用传统似然损失函数对所有可能句法树进行训练会加剧两个问题:1)$\textit{结构优化模糊性}$——在结构模糊的最优语法中任意选择,而忽视黄金标注句法树的特定偏好;2)$\textit{结构简洁性偏误}$——导致模型未能充分利用规则来构建句法树。这些挑战使得无监督神经语法归纳面临不可避免的预测误差、高方差以及需要庞大语法体系才能实现准确预测的问题。本文针对这些难题展开研究,系统分析其成因。作为解决方案,我们提出$\textit{句子级句法聚焦}$方法,通过利用同数据集上预训练句法分析器的结构偏置,为每句缩减损失评估所需的句法树集合。在无监督句法分析基准测试中,我们的方法显著提升了性能,同时有效降低了方差及对过度简化句法树的偏误。本研究推动了更紧凑、准确且一致的可解释显式语法学习,为提升模型可解释性提供了有效途径。