Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

from arxiv, 12 pages, 2 figures. Accepted at the SCOLIA 2026 Workshop (Second Workshop on Scholarly Information Access), co-located with ECIR 2026. Workshop date: April 2, 2026

Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and end-to-end document parsing -- under a schema-constrained setup that enables direct comparison between a strong supervised pipeline baseline (GROBID) and contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, and Qwen3-VL (4B-32B variants)). Across datasets, extraction largely saturates beyond a moderate capability threshold, while parsing and end-to-end parsing remain the primary bottlenecks due to structured-output brittleness under noisy layouts. We further show that lightweight LoRA adaptation yields consistent gains -- especially on SSH-heavy benchmarks -- and that segmentation/pipelining can substantially improve robustness. Finally, we argue for hybrid deployment via routing: leveraging GROBID for well-structured, in-distribution PDFs while escalating multilingual and footnote-heavy documents to task-adapted LLMs.

翻译：参考文献抽取与解析是引文索引、链接及下游学术知识图谱构建的基础。然而，现有评估大多聚焦于规范、英文、位于文档末尾的参考文献列表，未能充分体现社会科学与人文领域的特点——该领域的引文常呈现多语言混合、嵌入脚注、使用缩写形式，并受多样历史规范影响。本文提出一个统一基准测试，通过三个互补数据集针对上述SSH现实场景进行评估：CEX（跨学科英文期刊论文）、EXCITE（包含文末、纯脚注及混合引用模式的德英双语文献）以及LinkedBooks（具有显著文体差异与多语言特征的人文类参考文献）。我们在模式约束框架下评估了三个难度递增的任务——参考文献抽取、参考文献解析及端到端文档解析，以实现强监督流水线基线方法（GROBID）与当代大语言模型（DeepSeek-V3.1、Mistral-Small-3.2-24B、Gemma-3-27B-it及Qwen3-VL（4B-32B变体））的直接性能对比。跨数据集实验表明，抽取任务在超越中等能力阈值后趋于饱和，而解析与端到端解析因噪声版式下的结构化输出脆弱性仍是主要性能瓶颈。我们进一步证明，轻量级LoRA适配能带来稳定增益——在SSH密集型基准上尤为显著，且分段/流水线处理可显著提升系统鲁棒性。最后，我们提出基于路由的混合部署方案：对结构规范、分布内的PDF文档采用GROBID处理，而将多语言及脚注密集型文档转交由任务适配的大语言模型处理。