Multi-agent LLM systems for medical question answering often treat consensus as a reliability signal: if multiple agents agree on an answer, it is presumed trustworthy. However, answer-level consensus does not entail reasoning-level alignment. We introduce CARA (Cross-Agent Reasoning Alignment), a family of automated metrics that measure whether agents who agree on an answer also agree on the reasoning. Applying CARA to a standard debate system on two medical QA benchmarks, MedQA-USMLE and MedThink-Bench, we identify the consistency illusion: a failure mode where debate reduces detectable contradictions between agents while simultaneously decreasing the semantic similarity of their reasoning chains; agents appear to agree more but reason less consistently. To improve this misalignment, we propose the Grounded Debate Protocol (GDP), a prompt-level intervention that requires agents to commit to named medical facts and take explicit stances on other agents' claims. GDP produces large, consistent alignment improvements, with Cohen's d ranging from +1.43 to +1.99, across two datasets and two backbone models, without adding LLM calls or modifying system architecture. Our results motivate cross-agent reasoning alignment as a quantity to audit alongside accuracy in safety-critical domains.
翻译:面向医学问答的多智能体大语言模型系统常将共识视为可靠性信号:若多个智能体对答案达成一致,则该答案即被假定为可信。然而,答案级共识并不等同于推理级对齐。我们提出跨智能体推理对齐指标(CARA)——一套自动化评估指标,用于衡量答案一致的智能体是否在推理层面对齐。将CARA应用于两个医学问答基准(MedQA-USMLE和MedThink-Bench)的标准辩论系统后,我们识别出"一致性错觉"这一失效模式:辩论虽能降低智能体间可检测的矛盾,却同时削弱其推理链的语义相似性——智能体表面更趋一致,但推理一致性反而下降。为改善这种偏差,我们提出基础辩论协议(GDP):一种提示级干预机制,要求智能体锚定指定医学事实并对其他智能体的主张进行明确表态。GDP在两种数据集及两种基座模型上均产生显著且稳健的对齐改进(Cohen's d值介于+1.43至+1.99),且无需增加大语言模型调用次数或修改系统架构。研究结果表明,在安全关键领域,跨智能体推理对齐应作为与准确性并列的审计指标。