Semi-structured data formats such as JSON have proved to be useful data models for applications that require flexibility in the format of data stored. However, JSON data often come without the schemas that are typically available with relational data. This has resulted in a number of tools for discovering schemas from a collection of data. Although such tools can be useful, existing approaches focus on the syntax of documents and ignore semantic information. In this work, we explore the automatic addition of meaningful semantic information to discovered schemas similar to information that is added by human schema authors. We leverage large language models and a corpus of manually authored JSON Schema documents to generate natural language descriptions of schema elements, meaningful names for reusable definitions, and identify which discovered properties are most useful and which can be considered "noise". Our approach performs well on existing metrics for text generation that have been previously shown to correlate well with human judgement.
翻译:半结构化数据格式(如JSON)已被证明是适用于需要数据存储格式灵活性的应用程序的有效数据模型。然而,JSON数据通常缺乏关系型数据中常见的模式定义。这催生了多种从数据集合中发现模式的工具。尽管此类工具具有一定实用性,但现有方法主要关注文档的语法结构,而忽略了语义信息。本研究探索如何自动为发现的模式添加有意义的语义信息,类似于人类模式作者所添加的信息。我们利用大型语言模型和手动编写的JSON Schema文档语料库,生成模式元素的自然语言描述、为可重用定义赋予有意义的名称,并识别哪些发现的属性最具实用性、哪些可被视为“噪声”。我们的方法在现有文本生成指标上表现优异,这些指标已被证实与人类判断具有良好相关性。