In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM's understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don't adequately represent their meaning or capture the lexical properties of phrasal heads.
翻译:本文的贡献可从两个视角理解:从自然语言处理视角,我们引入了一个小规模NLI挑战数据集,该数据具有大量词汇重叠,最大程度降低了模型仅凭词元区分来辨别蕴含关系的可能性,并证明GPT-4和Llama 2在此数据集上存在严重偏差而失败。我们随后创建了更具挑战性的子任务以解释这一失败。从计算语言学视角,我们识别出一组包含三类形容词的结构,这些结构无法通过表层特征进行区分。这使得我们能够从多个维度探究大语言模型对这些结构的理解,发现它们以多种方式无法区分这些结构,这表明模型未能充分表征其语义或捕捉短语中心词的词汇特性。