Taxonomy inference for tabular data is a critical task of schema inference, aiming at discovering entity types (i.e., concepts) of the tables and building their hierarchy. It can play an important role in data management, data exploration, ontology learning, and many data-centric applications. Existing schema inference systems focus more on XML, JSON or RDF data, and often rely on lexical formats and structures of the data for calculating similarities, with limited exploitation of the semantics of the text across a table. Motivated by recent works on taxonomy completion and construction using Large Language Models (LLMs), this paper presents two LLM-based methods for taxonomy inference for tables: (i) EmTT which embeds columns by fine-tuning with contrastive learning encoder-alone LLMs like BERT and utilises clustering for hierarchy construction, and (ii) GeTT which generates table entity types and their hierarchy by iterative prompting using a decoder-alone LLM like GPT-4. Extensive evaluation on three real-world datasets with six metrics covering different aspects of the output taxonomies has demonstrated that EmTT and GeTT can both produce taxonomies with strong consistency relative to the Ground Truth.
翻译:表格数据的分类体系推断是模式推断中的一项关键任务,旨在发现表格的实体类型(即概念)并构建其层次结构。该任务在数据管理、数据探索、本体学习以及许多以数据为中心的应用中可发挥重要作用。现有的模式推断系统更多地关注XML、JSON或RDF数据,并且通常依赖数据的词汇格式和结构来计算相似性,对表格中文本语义的利用有限。受近期利用大型语言模型(LLMs)进行分类体系补全与构建的研究启发,本文提出了两种基于LLM的表格分类体系推断方法:(i)EmTT,该方法通过使用对比学习对编码器专用LLM(如BERT)进行微调以嵌入列向量,并利用聚类进行层次结构构建;以及(ii)GeTT,该方法通过使用解码器专用LLM(如GPT-4)进行迭代提示生成表格实体类型及其层次结构。在三个真实世界数据集上使用涵盖输出分类体系不同方面的六项指标进行的广泛评估表明,EmTT和GeTT均能产生与Ground Truth具有高度一致性的分类体系。