Despite great performance on many tasks, language models (LMs) still struggle with reasoning, sometimes providing responses that cannot possibly be true because they stem from logical incoherence. We call such responses \textit{strong hallucinations} and prove that they follow from an LM's computation of its internal representations for logical operators and outputs from those representations. Focusing on negation, we provide a novel solution in which negation is treated not as another element of a latent representation, but as \textit{an operation over an LM's latent representations that constrains how they may evolve}. We show that our approach improves model performance in cloze prompting and natural language inference tasks with negation without requiring training on sparse negative data.
翻译:尽管语言模型在许多任务上表现出色,但在推理方面仍存在困难,有时会因逻辑不一致而提供不可能为真的回答。我们将此类回答称为"强幻觉",并证明其源于语言模型对逻辑运算符内部表征的计算以及基于这些表征的输出。聚焦于否定逻辑,我们提出了一种新解法:不将否定视为潜在表征的孤立元素,而是将其作为"约束潜在表征演化方式的运算"。实验表明,该方法无需在稀疏负样本数据上训练,即可提升模型在含否定的完形填空提示和自然语言推理任务中的性能。