Scientific research relies on citation integrity, yet large language models (LLMs) have introduced a critical risk: fabricated references that appear plausible but correspond to no real publications. As manual verification becomes infeasible and existing automated tools remain fragile, we introduce CiteAudit, a comprehensive benchmark and detection framework for hallucinated citations. We design a multi-agent verification pipeline that decomposes citation checking into metadata extraction, memory lookup, web-based retrieval, and final judgment. To evaluate this, we construct a large-scale, human-validated dataset spanning diverse domains and hallucination types. Experiments demonstrate that our framework achieves superior verification performance over state-of-the-art LLMs and commercial baselines. Our work provides the necessary infrastructure to audit citations at scale and safeguard the trustworthiness of scholarly discourse. Code is available at https://github.com/shiiiikw/CiteAudit.
翻译:摘要:科学研究依赖于引文完整性,然而大语言模型(LLMs)引入了一个关键风险:看似合理却对应不存在真实出版的虚构参考文献。随着人工验证变得不可行且现有自动化工具仍显脆弱,我们提出CiteAudit——一个针对幻觉引文的综合性基准与检测框架。我们设计了一个多智能体验证流水线,将引文检查分解为元数据提取、记忆检索、基于网络的检索及最终判定。为评估该框架,我们构建了一个涵盖多领域与幻觉类型的大规模人工验证数据集。实验表明,我们的框架在验证性能上优于最先进的LLMs及商业基线。本研究为规模化审计引文、维护学术话语可信度提供了必要基础。代码见https://github.com/shiiiikw/CiteAudit。