Large language models (LLMs) remain unreliable for high-stakes claim verification due to hallucinations and shallow reasoning. While retrieval-augmented generation (RAG) and multi-agent debate (MAD) address this, they are limited by one-pass retrieval and unstructured debate dynamics. We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation. Our approach integrates specialized roles (e.g., Plaintiff, Defense, Judge) with Progressive RAG (P-RAG) to dynamically expand and refine the evidence pool during the debate. Furthermore, we employ evidence negotiation, self-reflection, and heterogeneous multi-judge aggregation to enforce calibration, robustness, and diversity. In zero-shot evaluations on the Check-COVID benchmark, PROClaim achieves 81.7% accuracy, outperforming standard multi-agent debate by 10.0 percentage points, with P-RAG driving the primary performance gains (+7.5 pp). We ultimately demonstrate that structural deliberation and model heterogeneity effectively mitigate systematic biases, providing a robust foundation for reliable claim verification. Our code and data are publicly available at https://github.com/mnc13/PROClaim.
翻译:大型语言模型(LLM)因幻觉和浅层推理在高风险声明验证中仍不可靠。虽然检索增强生成(RAG)和多智能体辩论(MAD)能缓解该问题,但受限于单次检索和非结构化辩论动态。我们提出法庭式多智能体框架PROClaim,将验证重构为结构化的对抗性审议。该方法整合了专业化角色(如原告、被告、法官)与渐进式RAG(P-RAG),在辩论过程中动态扩展和优化证据库。此外,我们采用证据协商、自我反思和异构多法官聚合机制,以实现校准性、鲁棒性和多样性。在Check-COVID基准的零样本评估中,PROClaim达到81.7%的准确率,比标准多智能体辩论提升10.0个百分点,其中P-RAG贡献了主要性能增长(+7.5个百分点)。最终证明,结构化审议和模型异构性可有效减轻系统性偏差,为可靠声明验证提供坚实基础。我们的代码和数据已在https://github.com/mnc13/PROClaim公开。