The diversity and Zipfian frequency distribution of natural language predicates in corpora leads to sparsity in Entailment Graphs (EGs) built by Open Relation Extraction (ORE). EGs are computationally efficient and explainable models of natural language inference, but as symbolic models, they fail if a novel premise or hypothesis vertex is missing at test-time. We present theory and methodology for overcoming such sparsity in symbolic models. First, we introduce a theory of optimal smoothing of EGs by constructing transitive chains. We then demonstrate an efficient, open-domain, and unsupervised smoothing method using an off-the-shelf Language Model to find approximations of missing premise predicates. This improves recall by 25.1 and 16.3 percentage points on two difficult directional entailment datasets, while raising average precision and maintaining model explainability. Further, in a QA task we show that EG smoothing is most useful for answering questions with lesser supporting text, where missing premise predicates are more costly. Finally, controlled experiments with WordNet confirm our theory and show that hypothesis smoothing is difficult, but possible in principle.
翻译:自然语言谓词在语料库中的多样性和齐夫频率分布,导致通过开放关系抽取构建的蕴涵图存在稀疏性问题。蕴涵图作为自然语言推理中计算高效且可解释的符号模型,当测试时出现未见的前提或假设顶点时,这类符号模型将失效。本文提出克服符号模型稀疏性的理论与方法:首先,引入通过构建传递链实现蕴涵图最优平滑化的理论;其次,利用现成的语言模型设计一种高效、开放域且无监督的平滑方法,用于近似缺失的前提谓词。该方法在两个困难的方向性蕴涵数据集上分别提升25.1和16.3个百分点的召回率,同时提高平均精确度并保持模型可解释性。在问答任务中,我们发现蕴涵图平滑化对支持文本较少的提问最为有效,因为此时缺失前提谓词的成本更高。最后,基于WordNet的对照实验验证了理论正确性,并表明假设平滑化虽具挑战性,但在原则上可行。