In this work we build upon negative results from an attempt at language modeling with predicted semantic structure, in order to establish empirical lower bounds on what could have made the attempt successful. More specifically, we design a concise binary vector representation of semantic structure at the lexical level and evaluate in-depth how good an incremental tagger needs to be in order to achieve better-than-baseline performance with an end-to-end semantic-bootstrapping language model. We envision such a system as consisting of a (pretrained) sequential-neural component and a hierarchical-symbolic component working together to generate text with low surprisal and high linguistic interpretability. We find that (a) dimensionality of the semantic vector representation can be dramatically reduced without losing its main advantages and (b) lower bounds on prediction quality cannot be established via a single score alone, but need to take the distributions of signal and noise into account.
翻译:本研究基于对语义结构预测进行语言建模的尝试结果,建立起能使该尝试成功的经验性下界。具体而言,我们设计了一种简洁的二元向量表示法来描述词汇层面的语义结构,并深入评估了增量标注器需达到何种精度,方能使端到端语义引导语言模型取得优于基线的性能。我们设想该系统由(预训练的)序列化神经组件与层级符号组件协同构成,旨在生成低意外度且高语言可解释性的文本。研究发现:(a)语义向量表示的维度可在不损失主要优势的情况下显著降低;(b)预测质量的量化下界无法仅通过单一评分建立,而需综合考虑信号与噪声的分布特征。