A fundamental question in natural language processing is - what kind of language structure and semantics is the language model capturing? Graph formats such as knowledge graphs are easy to evaluate as they explicitly express language semantics and structure. This study evaluates the semantics encoded in the self-attention transformers by leveraging explicit knowledge graph structures. We propose novel metrics to measure the reconstruction error when providing graph path sequences from a knowledge graph and trying to reproduce/reconstruct the same from the outputs of the self-attention transformer models. The opacity of language models has an immense bearing on societal issues of trust and explainable decision outcomes. Our findings suggest that language models are models of stochastic control processes for plausible language pattern generation. However, they do not ascribe object and concept-level meaning and semantics to the learned stochastic patterns such as those described in knowledge graphs. Furthermore, to enable robust evaluation of concept understanding by language models, we construct and make public an augmented language understanding benchmark built on the General Language Understanding Evaluation (GLUE) benchmark. This has significant application-level user trust implications as stochastic patterns without a strong sense of meaning cannot be trusted in high-stakes applications.
翻译:自然语言处理中的一个基本问题是:语言模型捕获了何种语言结构和语义?知识图谱等图格式易于评估,因为它们明确表达了语言语义和结构。本研究通过利用显式知识图谱结构,评估自注意力Transformer中编码的语义。我们提出了新颖的度量指标,用于测量从知识图谱提供图路径序列并尝试从自注意力Transformer模型输出中复现/重构相同序列时的重构误差。语言模型的不透明性对社会信任和可解释决策结果等议题具有重大影响。我们的研究发现表明,语言模型是用于生成合理语言模式的随机控制过程模型,但它们并未像知识图谱中所描述的那样,将对象和概念级别的含义与语义赋予所学习的随机模式。此外,为了实现对语言模型概念理解的稳健评估,我们构建并公开了一个基于通用语言理解评估(GLUE)基准的增强型语言理解基准。这具有重要的应用层面用户信任意义,因为在高风险应用中,缺乏强烈含义的随机模式无法被信任。