Semantic types are a more powerful and detailed way of describing data than atomic types such as strings or integers. They establish connections between columns and concepts from the real world, providing more nuanced and fine-grained information that can be useful for tasks such as automated data cleaning, schema matching, and data discovery. Existing deep learning models trained on large text corpora have been successful at performing single-column semantic type prediction for relational data. However, in this work, we propose an extension of the semantic type prediction problem to JSON data, labeling the types based on JSON Paths. Similar to columns in relational data, JSON Path is a query language that enables the navigation of complex JSON data structures by specifying the location and content of the elements. We use a graph neural network to comprehend the structural information within collections of JSON documents. Our model outperforms a state-of-the-art existing model in several cases. These results demonstrate the ability of our model to understand complex JSON data and its potential usage for JSON-related data processing tasks.
翻译:语义类型是一种比字符串或整数等原子类型更强大、更细致的数据描述方式。它们建立了数据列与现实世界概念之间的关联,能提供更细致入微的信息,对自动化数据清洗、模式匹配和数据发现等任务具有重要价值。现有基于大规模文本语料训练的深度学习模型已在关系型数据的单列语义类型预测任务上取得成效。然而,本工作将语义类型预测问题扩展至JSON数据场景,基于JSON路径对类型进行标注。类似于关系型数据中的列,JSON路径是一种查询语言,可通过指定元素位置与内容来导航复杂的JSON数据结构。我们采用图神经网络来理解JSON文档集合中的结构信息。在多个案例中,本模型的表现超越了现有最优模型。这些结果验证了模型理解复杂JSON数据的能力,及其在JSON相关数据处理任务中的潜在应用价值。