GhostCite：大语言模型时代下引用有效性的大规模分析 (GhostCite: A Large-Scale Analysis of Citation Validity in the Age of Large Language Models)

Zuyao Xu,Yuqi Qiu,Lu Sun,FaSheng Miao,Fubin Wu,Xinyi Wang,Xiang Li,Haozhe Lu,ZhengZe Zhang,Yuxin Hu,Jialu Li,Jin Luo,Feng Zhang,Rui Luo,Xinran Liu,Yingxian Li,Jiaji Liu

Citations provide the basis for trusting scientific claims; when they are invalid or fabricated, this trust collapses. With the advent of Large Language Models (LLMs), this risk has intensified: LLMs are increasingly used for academic writing, yet their tendency to fabricate citations (``ghost citations'') poses a systemic threat to citation validity. To quantify this threat and inform mitigation, we develop CiteVerifier, an open-source framework for large-scale citation verification, and conduct the first comprehensive study of citation validity in the LLM era through three experiments built on it. We benchmark 13 state-of-the-art LLMs on citation generation across 40 research domains, finding that all models hallucinate citations at rates from 14.23\% to 94.93\%, with significant variation across research domains. Moreover, we analyze 2.2 million citations from 56,381 papers published at top-tier AI/ML and Security venues (2020--2025), confirming that 1.07\% of papers contain invalid or fabricated citations (604 papers), with an 80.9\% increase in 2025 alone. Furthermore, we survey 97 researchers and analyze 94 valid responses after removing 3 conflicting samples, revealing a critical ``verification gap'': 41.5\% of researchers copy-paste BibTeX without checking and 44.4\% choose no-action responses when encountering suspicious references; meanwhile, 76.7\% of reviewers do not thoroughly check references and 80.0\% never suspect fake citations. Our findings reveal an accelerating crisis where unreliable AI tools, combined with inadequate human verification by researchers and insufficient peer review scrutiny, enable fabricated citations to contaminate the scientific record. We propose interventions for researchers, venues, and tool developers to protect citation integrity.

翻译：引用是科学论断可信度的基石；当引用无效或捏造时，这种信任便会崩塌。随着大语言模型（LLMs）的出现，这一风险加剧：LLMs 越来越多地被用于学术写作，但其捏造引用的倾向（“幽灵引用”）对引用有效性构成了系统性威胁。为量化这一威胁并为缓解措施提供依据，我们开发了 CiteVerifier——一个用于大规模引用验证的开源框架，并基于此框架通过三项实验，首次对 LLM 时代的引用有效性进行了全面研究。我们在 40 个研究领域中对 13 个最先进的 LLM 进行了引用生成基准测试，发现所有模型均会产生幻觉引用，比率从 14.23% 到 94.93% 不等，且不同研究领域间存在显著差异。此外，我们分析了来自 56,381 篇发表于顶级 AI/ML 和安全会议（2020–2025 年）的论文中的 220 万条引用，确认有 1.07% 的论文（604 篇）包含无效或捏造的引用，仅 2025 年就增长了 80.9%。进一步，我们调查了 97 位研究人员，在剔除 3 个冲突样本后分析了 94 份有效回复，揭示了一个关键的“验证缺口”：41.5% 的研究人员会直接复制粘贴 BibTeX 而不进行检查，44.4% 在遇到可疑参考文献时选择不采取行动；同时，76.7% 的审稿人不会彻底检查参考文献，80.0% 从未怀疑过虚假引用。我们的研究结果揭示了一个正在加速的危机：不可靠的 AI 工具，加上研究人员不充分的人工验证以及同行评审审查不足，使得捏造的引用得以污染科学记录。我们为研究人员、会议主办方和工具开发者提出了干预措施，以保护引用的完整性。