HGNet: Scalable Foundation Model for Automated Knowledge Graph Generation from Scientific Literature

Automated knowledge graph (KG) construction is essential for navigating the rapidly expanding body of scientific literature. However, existing approaches struggle to recognize long multi-word entities, often fail to generalize across domains, and typically overlook the hierarchical nature of scientific knowledge. While general-purpose large language models (LLMs) offer adaptability, they are computationally expensive and yield inconsistent accuracy on specialized tasks. As a result, current KGs are shallow and inconsistent, limiting their utility for exploration and synthesis. We propose a two-stage framework for scalable, zero-shot scientific KG construction. The first stage, Z-NERD, introduces (i) Orthogonal Semantic Decomposition (OSD), which promotes domain-agnostic entity recognition by isolating semantic "turns" in text, and (ii) a Multi-Scale TCQK attention mechanism that captures coherent multi-word entities through n-gram-aware attention heads. The second stage, HGNet, performs relation extraction with hierarchy-aware message passing, explicitly modeling parent, child, and peer relations. To enforce global consistency, we introduce two complementary objectives: a Differentiable Hierarchy Loss to discourage cycles and shortcut edges, and a Continuum Abstraction Field (CAF) Loss that embeds abstraction levels along a learnable axis in Euclidean space. This is the first approach to formalize hierarchical abstraction as a continuous property within standard Euclidean embeddings, offering a simpler alternative to hyperbolic methods. We release SPHERE (https://github.com/basiralab/SPHERE), a multi-domain benchmark for hierarchical relation extraction. Our framework establishes a new state of the art on SciERC, SciER, and SPHERE, improving NER by 8.08% and RE by 5.99% on out-of-distribution tests. In zero-shot settings, gains reach 10.76% for NER and 26.2% for RE.

翻译：自动化知识图谱（KG）构建对于驾驭快速增长的科技文献至关重要。然而现有方法难以识别长跨度的多词实体，往往无法跨领域泛化，且通常忽略了科学知识的层级本质。尽管通用大语言模型（LLM）具有适应性，但在专业任务上计算成本高昂且精度不稳定，导致当前知识图谱浅层且不一致，限制了其在探索与综合中的实用性。我们提出面向可扩展零样本科学知识图谱构建的两阶段框架：第一阶段Z-NERD引入：(i) 正交语义分解（OSD），通过分离文本中的语义"转向"促进域无关实体识别；(ii) 多尺度TCQK注意力机制，利用n-gram感知注意力头捕获连贯的多词实体。第二阶段HGNet通过层级感知消息传递执行关系抽取，显式建模父代、子代与同级关系。为强调整体一致性，我们提出两个互补目标函数：可微层级损失函数（抑制回路与捷径边），以及连续抽象场（CAF）损失函数（在欧氏空间中将抽象层级沿可学习轴嵌入）。这是首个将层级抽象形式化为标准欧氏嵌入中连续属性的方法，为双曲方法提供了更简洁的替代方案。我们发布面向层级关系抽取的多领域基准SPHERE（https://github.com/basiralab/SPHERE）。本框架在SciERC、SciER与SPHERE上取得新最优结果，在分布外测试中NER提升8.08%，RE提升5.99%。零样本设置下，NER与RE分别提升10.76%与26.2%。