GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

Large language models (LLMs) are increasingly used in academic research workflows, but scholarly tasks require high factual precision and therefore expose a key weakness: overconfidence. Here, overconfidence is defined behaviorally as the tendency to produce confident, assertive, and well-formatted outputs even when the underlying knowledge is incomplete or unverifiable, rather than as a calibration gap between stated confidence and accuracy. To examine this issue, we introduce GIScholarBench, a benchmark built from 10,865 papers published in 25 core GIScience journals between 2020 and 2025. The benchmark covers three tasks with increasing cognitive complexity: metadata retrieval, literature linking, and research direction generation. We evaluate Claude Sonnet 4.5, Gemini 3, and ChatGPT 5.3 through their native web interfaces under real-world user-facing conditions. Results show consistent overconfidence across all tasks. In metadata retrieval, ChatGPT 5.3 achieves the highest accuracy, but all models still generate definitive titles and DOIs when predictions are wrong. In literature linking, Claude Sonnet 4.5 recovers the most references, but all models show a clear gap between top-ranked retrieval and longer citation lists, suggesting that references are extended beyond reliable retrieval capacity. In research direction generation, AI-generated directions show lower topic coverage, higher novel miss rates, and lower semantic diversity than real future-citing papers. These findings suggest that LLM overconfidence is task-invariant but takes different forms: factual overgeneration in retrieval, unreliable citation expansion in literature linking, and overconfidence in output completeness during research ideation.

翻译：大语言模型（LLMs）越来越多地被用于学术研究工作流，但学术任务要求高度的事实精确性，因此暴露了一个关键弱点：过度自信。在此，我们将过度自信行为定义为：即使底层知识不完整或无法验证，模型仍倾向于生成自信、肯定且格式规范的输出，而非陈述置信度与准确性之间的校准差距。为探究该问题，我们构建了GIScholarBench基准数据集，该数据集基于2020年至2025年间发表在25本核心GIScience期刊上的10,865篇论文。该基准涵盖三项认知复杂度递增的任务：元数据检索、文献关联和研究方向生成。我们通过Claude Sonnet 4.5、Gemini 3和ChatGPT 5.3的原生网络界面，在真实用户场景下对其进行了评估。结果显示，所有任务均存在持续的过度自信现象。在元数据检索中，ChatGPT 5.3的准确率最高，但所有模型在预测错误时仍会生成看似确定的标题和DOI。在文献关联中，Claude Sonnet 4.5恢复了最多参考文献，但所有模型在排名靠前的检索结果与更长的引文列表之间存在明显差距，表明参考文献的扩展已超出可靠的检索能力。在研究方向生成中，与真实的未来引用论文相比，AI生成的研究方向在主题覆盖率上更低，在新颖性遗漏率上更高，且在语义多样性上更低。这些发现表明，LLMs的过度自信具有任务不变性，但表现形式各异：检索中的事实性过度生成、文献关联中不可靠的引文扩展，以及研究构思阶段对输出完整性的过度自信。