Negation is a fundamental linguistic phenomenon that poses ongoing challenges for Large Language Models (LLMs), particularly in tasks requiring deep semantic understanding. Current benchmarks often treat negation as a minor detail within broader tasks, such as natural language inference. Consequently, there is a lack of benchmarks specifically designed to evaluate comprehension of negation. In this work, we introduce Thunder-NUBench, a novel benchmark explicitly created to assess sentence-level understanding of negation in LLMs. Thunder-NUBench goes beyond merely identifying surface-level cues by contrasting standard negation with structurally diverse alternatives, such as local negation, contradiction, and paraphrase. This benchmark includes manually curated sentence-negation pairs and a multiple-choice dataset, allowing for a comprehensive evaluation of models' understanding of negation.
翻译:否定是一种基本的语言现象,对大语言模型(LLMs)构成了持续的挑战,尤其是在需要深度语义理解的任务中。当前的基准测试通常将否定视为更广泛任务(如自然语言推理)中的一个次要细节。因此,目前缺乏专门用于评估否定理解能力的基准。在本工作中,我们引入了雷暴-NUBench,这是一个新颖的基准,专门用于评估LLMs在句子级别对否定的理解。雷暴-NUBench超越了仅识别表面线索的层面,通过将标准否定与结构多样的替代形式(如局部否定、矛盾句和释义句)进行对比。该基准包含人工整理的句子-否定对和一个多项选择数据集,从而能够全面评估模型对否定的理解。