Knowledge graphs (KGs) have emerged as a prominent data representation and management paradigm. Being usually underpinned by a schema (e.g. an ontology), KGs capture not only factual information but also contextual knowledge. In some tasks, a few KGs established themselves as standard benchmarks. However, recent works outline that relying on a limited collection of datasets is not sufficient to assess the generalization capability of an approach. In some data-sensitive fields such as education or medicine, access to public datasets is even more limited. To remedy the aforementioned issues, we release PyGraft, a Python-based tool that generates highly customized, domain-agnostic schemas and knowledge graphs. The synthesized schemas encompass various RDFS and OWL constructs, while the synthesized KGs emulate the characteristics and scale of real-world KGs. Logical consistency of the generated resources is ultimately ensured by running a description logic (DL) reasoner. By providing a way of generating both a schema and KG in a single pipeline, PyGraft's aim is to empower the generation of a more diverse array of KGs for benchmarking novel approaches in areas such as graph-based machine learning (ML), or more generally KG processing. In graph-based ML in particular, this should foster a more holistic evaluation of model performance and generalization capability, thereby going beyond the limited collection of available benchmarks. PyGraft is available at: https://github.com/nicolas-hbt/pygraft.
翻译:知识图谱(KG)已成为一种突出的数据表示与管理范式。通常以模式(如本体)为基础,知识图谱不仅捕获事实信息,还蕴含上下文知识。在某些任务中,若干知识图谱已确立为标准基准。然而,近期研究指出,依赖有限的数据集集合不足以评估方法的泛化能力。在教育或医学等数据敏感领域,公共数据集的访问甚至更为受限。为解决上述问题,我们发布了PyGraft——一款基于Python的工具,可生成高度定制化且领域无关的模式与知识图谱。综合而成的模式包含多种RDFS和OWL构造,而综合而成的知识图谱则模拟真实世界KG的特征与规模。通过运行描述逻辑(DL)推理器,最终确保生成资源的逻辑一致性。PyGraft旨在通过单一流水线同时生成模式与知识图谱,从而赋能更丰富多样的KG生成,为图机器学习(ML)或更通用的KG处理等领域的新方法提供基准测试。特别是在基于图的ML中,这应能促进对模型性能与泛化能力进行更全面的评估,从而突破现有有限基准集合的局限。PyGraft的访问地址为:https://github.com/nicolas-hbt/pygraft。