Existing scientific relation extraction benchmarks mainly target domains such as computer science, where entities are tasks, methods, datasets, materials, or metrics. This leaves a gap in variable-oriented empirical fields such as psychology, where findings are expressed as relations among constructs, measurements, interventions, and outcomes. We introduce variable-centered empirical graph extraction, the task of mapping scientific abstracts to typed graphs whose nodes are normalized variables and whose edges represent empirical and hierarchical relations. To support this task, we construct EmpiriGraph-Psy, a benchmark of 210 psychology abstracts annotated by domain-trained annotators with normalized variables, concept hierarchies, empirical relation types, and validation states. We evaluate frontier and open-weight LLMs using both direct extraction and a staged graph-construction pipeline that separates variable extraction, normalization, hierarchy construction, evidence selection, relation extraction, and edge validation. The staged pipeline substantially outperforms direct extraction, with the best configuration achieving a macro-F1 of 0.74. Error analysis shows that moderation relations and concept hierarchies remain the most challenging cases, highlighting the difficulty of extracting higher-order empirical claims and implicit abstraction structure from scientific abstracts.
翻译:现有科学关系提取基准主要面向计算机科学等领域,其中实体包括任务、方法、数据集、材料或指标。这导致在心理学等以变量为导向的经验性领域中存在空白——此类领域的研究发现常表述为构念、测量、干预与结果之间的关系。我们提出以变量为中心的经验性图提取任务:将科学摘要映射为类型化图,其节点为标准化变量,边代表经验性与层级关系。为支撑该任务,我们构建了EmpiriGraph-Psy基准数据集,包含210篇经领域培训标注员标注的心理学摘要,涵盖标准化变量、概念层级、经验关系类型及验证状态。我们采用直接提取与分阶段图构建流程(分离变量提取、标准化、层级构建、证据选择、关系提取及边验证)对前沿及开源大语言模型进行评估。实验表明,分阶段流程在性能上显著优于直接提取,最佳配置的宏F1值达0.74。误差分析显示,调节关系与概念层级仍为最具挑战性的案例,凸显了从科学摘要中提取高阶经验主张及隐式抽象结构的困难。