Current evaluation methods for Attributed Question Answering (AQA) suffer from \textit{attribution myopia}: they emphasize verification of isolated statements and their attributions but overlook the global logical integrity of long-form answers. Consequently, Large Language Models (LLMs) often produce factually grounded yet logically incoherent responses with elusive deductive gaps. To mitigate this limitation, we present \textsc{LogicScore}, a unified evaluation framework that shifts the paradigm from local assessment to global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Conciseness} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high attribution scores (e.g., 92.85\% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11\% Conciseness for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development. Codes are available at: https://github.com/zhichaoyan11/LogicScore.
翻译:当前归因问答(AQA)的评估方法存在“归因短视”问题:它们侧重于孤立陈述及其归证的验证,却忽视了长篇幅答案的整体逻辑一致性。因此,大语言模型(LLMs)常生成事实依据充分但逻辑不连贯的回复,其中存在难以察觉的推理断层。为缓解这一局限,我们提出 \textsc{LogicScore},一个统一的评估框架,将范式从局部评估转向全局推理审查。基于Horn规则,我们的方法整合了后向验证机制,以系统性地评估三个关键推理维度:\textit{完备性}(逻辑严密的演绎)、\textit{简洁性}(非冗余性)与\textit{确定性}(一致的答案蕴含性)。在三个多跳问答数据集(HotpotQA、MusiQue 和 2WikiMultiHopQA)上对超过20个LLM(包括 GPT-5、Gemini-3-Pro、LLaMA3 以及任务特定微调模型)进行的大量实验揭示了一个关键的能力差距:领先模型常获得高归因分数(例如 Gemini-3 Pro 精确率达 92.85\%),但在全局推理质量上表现不佳(例如 Gemini-3 Pro 的简洁性仅为 35.11\%)。我们的工作为逻辑评估建立了一个稳健的标准,强调了在LLM开发中需将推理连贯性与事实依据性置于同等重要地位。代码发布于:https://github.com/zhichaoyan11/LogicScore。