Current evaluation methods for Attributed Question Answering (AQA) suffer from \textit{attribution myopia}: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long-form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textsc{LogicScore}, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Conciseness} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85\% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11\% Conciseness for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: https://github.com/zhichaoyan11/LogicScore.
翻译:当前归因问答(AQA)的评估方法存在“归因短视”问题:它们侧重于孤立陈述及其归据的验证,却忽视了长篇幅答案的整体逻辑完整性。因此,大型语言模型(LLMs)常生成事实依据充分但逻辑不连贯的回复,其中存在难以察觉的推理断层。为弥补这一局限,我们提出LogicScore,一个统一的评估框架,将范式从局部评估转向全局推理审查。基于Horn规则,我们的方法整合了后向验证机制,以系统性地评估三个关键推理维度:完备性(逻辑严密的演绎)、简洁性(非冗余性)和确定性(一致的答案蕴含)。在三个多跳问答数据集(HotpotQA、MusiQue和2WikiMultiHopQA)及超过20个LLM(包括GPT-5、Gemini-3-Pro、LLaMA3以及任务特定调优模型)上的大量实验揭示了一个关键能力差距:领先模型常获得高归因分数(例如Gemini-3 Pro的精确率达92.85%),但在全局推理质量上表现不佳(例如Gemini-3 Pro的简洁性仅为35.11%)。我们的工作为逻辑评估建立了稳健标准,强调了在LLM开发中需将推理连贯性与事实依据性置于同等重要地位。代码发布于:https://github.com/zhichaoyan11/LogicScore。