Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To address the scarcity of suitable gold-standard resources, we present FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels), an openly licensed multilingual dataset of 96 annotated scholarly articles containing over 7,600 footnote-embedded references, together with PDF-TEI Editor (a collaborative web annotation tool), a documented seven-annotator workflow, and a Grobid specialization for footnote-based citations. In end-to-end evaluation, the specialized pipeline nearly doubles extraction quality over default Grobid (micro-F1 from 0.36 to 0.72), driven largely by improved recall, while showing that substantial headroom remains for cross-references and mixed-content footnotes. This extended abstract presents work in progress; annotations of citations segmentation and parsing, and cross-reference resolution are ongoing.
翻译:引文抽取工具主要针对自然科学文献末尾的结构化参考文献设计,但法律与人文学科的学术著作主要通过在脚注中标注参考文献,其书目数据与评论、交叉引用交错呈现,且在不同语言和风格间差异显著。为解决优质黄金标准资源匮乏的问题,我们提出了FOSSIL(基于脚注的开放获取社科人文科学实例标签)数据集,这是一个包含96篇已标注学术论文的开放许可多语言数据集,涵盖逾7600条脚注嵌入式参考文献。同步配套发布的有:PDF-TEI编辑器(协同网页注释工具)、经文档化的七标注者工作流,以及面向脚注式引文的Grobid专用化方案。在端到端评估中,专用化流程相比默认Grobid将抽取质量提升近一倍(微平均F1值从0.36升至0.72),主要归功于召回率的显著改善,同时显示针对交叉引用和混合内容脚注仍有较大优化空间。本文为阶段性成果扩展摘要:引文分割与解析的标注工作及交叉引用消解任务仍在持续进行中。