Traditional dataset retrieval systems index on metadata information rather than on the data values. Thus relying primarily on manual annotations and high-quality metadata, processes known to be labour-intensive and challenging to automate. We propose a method to support metadata enrichment with topic annotations of column headers using three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard and GoogleGemini. We investigate the LLMs ability to classify column headers based on domain-specific topics from a controlled vocabulary. We evaluate our approach by assessing the internal consistency of the LLMs, the inter-machine alignment, and the human-machine agreement for the topic classification task. Additionally, we investigate the impact of contextual information (i.e. dataset description) on the classification outcomes. Our results suggest that ChatGPT and GoogleGemini outperform GoogleBard for internal consistency as well as LLM-human-alignment. Interestingly, we found that context had no impact on the LLMs performances. This work proposes a novel approach that leverages LLMs for text classification using a controlled topic vocabulary, which has the potential to facilitate automated metadata enrichment, thereby enhancing dataset retrieval and the Findability, Accessibility, Interoperability and Reusability (FAIR) of research data on the Web.
翻译:传统数据集检索系统索引的是元数据信息而非数据值本身,因此主要依赖人工标注和高质量元数据,这类过程已知存在劳动强度大且难以自动化的问题。我们提出一种方法,利用三种大语言模型(LLMs):ChatGPT-3.5、GoogleBard和GoogleGemini,通过主题标注列标题来支持元数据增强。本研究探究LLMs基于受控词表中的领域特定主题对列标题进行分类的能力。我们通过评估LLMs的内部一致性、机器间对齐性及人机一致性来评价本方法在主题分类任务中的表现。此外,我们考察了上下文信息(即数据集描述)对分类结果的影响。实验结果表明,ChatGPT与GoogleGemini在内部一致性和LLM-人类对齐性方面均优于GoogleBard。值得注意的是,我们发现上下文信息对LLMs性能没有影响。本研究提出了一种利用LLMs基于受控主题词表进行文本分类的创新方法,该方法具有促进元数据自动增强的潜力,从而提升网络研究数据的数据集检索能力及可发现性(Findability)、可访问性(Accessibility)、互操作性(Interoperability)与可重用性(Reusability,FAIR)原则。