Scientific concepts are often defined inconsistently across papers, making it difficult to compare findings, reuse terminology, and build reliable downstream resources. We present SciDef, a resource suite for scientific definition extraction. The suite contains DefExtra, a benchmark of 268 human-validated author-stated definitions from 75 academic papers; DefSim, 60 human-labeled definition-pair similarity judgments; and an open LLM-based pipeline for PDF preprocessing, chunking, definition extraction, prompt optimization, and evaluation. We validate the resources by benchmarking 16 language models across prompting strategies and chunking schemes. The strongest set-level configuration achieves a score of 0.397, while the highest-coverage configuration matches at least one prediction to 86.4% of gold definitions but over-generates candidate definitions. We further show that an NLI-based matching metric agrees strongly with human DefSim judgments. These results position SciDef as a reusable benchmark and tooling layer for definition-centric literature analysis, while highlighting relevance-aware filtering as the key bottleneck for fully automatic definition extraction. Code & datasets are available at https://github.com/Media-Bias-Group/SciDef.
翻译:科学概念在不同论文中的定义常常不一致,这使得比较研究结果、复用术语以及构建可靠的下游资源变得困难。我们提出SciDef,一个用于科学定义提取的资源套件。该套件包含DefExtra——一个包含来自75篇学术论文的268个人工验证的作者声明定义的基准测试集;DefSim——60个人工标注的定义对相似度判断;以及一个基于开源大语言模型的流水线,涵盖PDF预处理、分块、定义提取、提示优化与评估。我们通过跨越多种提示策略和分块方案对16个语言模型进行基准测试来验证这些资源。最强集合级配置达到了0.397的得分,而最高覆盖率的配置虽然与86.4%的标准答案定义了至少一个匹配预测,但存在过度生成候选定义的问题。我们进一步证明,基于自然语言推理的匹配度量与人工DefSim判断高度一致。这些成果将SciDef定位为以定义为中心文献分析的可复用基准测试与工具层,同时揭示了相关性感知过滤是全自动定义提取的关键瓶颈。代码与数据集可在 https://github.com/Media-Bias-Group/SciDef 获取。