Despite great performance on many tasks, language models (LMs) still struggle with reasoning, sometimes providing responses that cannot possibly be true because they stem from logical incoherence. We call such responses \textit{strong hallucinations} and prove that they follow from an LM's computation of its internal representations for logical operators and outputs from those representations. Focusing on negation, we provide a novel solution in which negation is treated not as another element of a latent representation, but as \textit{an operation over an LM's latent representations that constrains how they may evolve}. We show that our approach improves model performance in cloze prompting and natural language inference tasks with negation without requiring training on sparse negative data.
翻译:尽管语言模型在许多任务上表现出色,但其推理能力仍存在不足,有时会因逻辑不一致而产生根本不可能成立的回答。我们将此类回答称为\textit{强幻觉},并证明其源于语言模型对逻辑运算符内部表征的计算以及基于这些表征的输出。针对否定现象,我们提出了一种创新解决方案:否定不再被视为潜在表征的构成元素,而是作为\textit{作用于语言模型潜在表征的运算操作,用以约束这些表征的演化方式}。研究表明,该方法无需依赖稀疏的否定数据训练,即可在含否定结构的完形填空提示与自然语言推理任务中提升模型性能。