The problem of identifying a probabilistic context free grammar has two aspects: the first is determining the grammar's topology (the rules of the grammar) and the second is estimating probabilistic weights for each rule. Given the hardness results for learning context-free grammars in general, and probabilistic grammars in particular, most of the literature has concentrated on the second problem. In this work we address the first problem. We restrict attention to structurally unambiguous weighted context-free grammars (SUWCFG) and provide a query learning algorithm for \structurally unambiguous probabilistic context-free grammars (SUPCFG). We show that SUWCFG can be represented using \emph{co-linear multiplicity tree automata} (CMTA), and provide a polynomial learning algorithm that learns CMTAs. We show that the learned CMTA can be converted into a probabilistic grammar, thus providing a complete algorithm for learning a structurally unambiguous probabilistic context free grammar (both the grammar topology and the probabilistic weights) using structured membership queries and structured equivalence queries. A summarized version of this work was published at AAAI 21.
翻译:识别概率上下文无关文法的问题包含两个方面:第一是确定文法的拓扑结构(即文法规则),第二是估计每条规则的概率权重。鉴于学习上下文无关文法(尤其是概率文法)的困难性结果,现有文献大多集中于第二个问题。本研究针对第一个问题展开讨论。我们将关注范围限定于结构无歧义加权上下文无关文法(SUWCFG),并提出了一种用于学习结构无歧义概率上下文无关文法(SUPCFG)的查询学习算法。我们证明SUWCFG可以用共线性多重树自动机(CMTA)表示,并提出了一个多项式学习算法来学习CMTA。进而证明所学得的CMTA可转换为概率文法,从而提供了一套完整的算法——通过结构化成员查询和结构化等价查询——来学习结构无歧义概率上下文无关文法(同时包含文法拓扑结构和概率权重)。本工作的精简版本发表于AAAI 21会议。