Recent years have witnessed significant advancements in graph machine learning (GML), with its applications spanning numerous domains. However, the focus of GML has predominantly been on developing powerful models, often overlooking a crucial initial step: constructing suitable graphs from common data formats, such as tabular data. This construction process is fundamental to applying graph-based models, yet it remains largely understudied and lacks formalization. Our research aims to address this gap by formalizing the graph construction problem and proposing an effective solution. We identify two critical challenges to achieve this goal: 1. The absence of dedicated datasets to formalize and evaluate the effectiveness of graph construction methods, and 2. Existing automatic construction methods can only be applied to some specific cases, while tedious human engineering is required to generate high-quality graphs. To tackle these challenges, we present a two-fold contribution. First, we introduce a set of datasets to formalize and evaluate graph construction methods. Second, we propose an LLM-based solution, AutoG, automatically generating high-quality graph schemas without human intervention. The experimental results demonstrate that the quality of constructed graphs is critical to downstream task performance, and AutoG can generate high-quality graphs that rival those produced by human experts. Our code can be accessible from https://github.com/amazon-science/Automatic-Table-to-Graph-Generation.
翻译:近年来,图机器学习(GML)取得了显著进展,其应用遍及众多领域。然而,GML的研究重点主要集中于开发强大的模型,往往忽视了一个关键的前置步骤:从常见数据格式(如表格数据)构建合适的图。这一构建过程是应用图模型的基础,但目前仍缺乏系统性研究且尚未被形式化。本研究旨在通过形式化图构建问题并提出有效解决方案来填补这一空白。为实现此目标,我们识别出两大关键挑战:1. 缺乏专门用于形式化与评估图构建方法效能的数据集;2. 现有自动构建方法仅适用于特定场景,而生成高质量图仍需繁琐的人工工程。为应对这些挑战,我们提出双重贡献:首先,我们引入一组数据集以形式化并评估图构建方法;其次,我们提出基于大语言模型的解决方案 AutoG,无需人工干预即可自动生成高质量图模式。实验结果表明,构建图的质量对下游任务性能至关重要,且 AutoG 生成的图质量可与专家构建的图相媲美。我们的代码可通过 https://github.com/amazon-science/Automatic-Table-to-Graph-Generation 获取。