In support of open and reproducible research, there has been a rapidly increasing number of datasets made available for research. As the availability of datasets increases, it becomes more important to have quality metadata for discovering and reusing them. Yet, it is a common issue that datasets often lack quality metadata due to limited resources for data curation. Meanwhile, technologies such as artificial intelligence and large language models (LLMs) are progressing rapidly. Recently, systems based on these technologies, such as ChatGPT, have demonstrated promising capabilities for certain data curation tasks. This paper proposes to leverage LLMs for cost-effective annotation of subject metadata through the LLM-based in-context learning. Our method employs GPT-3.5 with prompts designed for annotating subject metadata, demonstrating promising performance in automatic metadata annotation. However, models based on in-context learning cannot acquire discipline-specific rules, resulting in lower performance in several categories. This limitation arises from the limited contextual information available for subject inference. To the best of our knowledge, we are introducing, for the first time, an in-context learning method that harnesses large language models for automated subject metadata annotation.
翻译:为支持开放和可重复性研究,可供研究使用的数据集数量迅速增长。随着数据集可用性的提高,拥有高质量的元数据以发现和重复利用这些数据集变得愈加重要。然而,由于数据策展资源有限,数据集通常缺乏高质量元数据是一个常见问题。与此同时,人工智能和大语言模型等技术正在快速发展。近期,基于这些技术的系统(如ChatGPT)已展现出在某些数据策展任务中的潜力。本文提出利用大语言模型,通过基于大语言模型的上下文学习,以经济高效的方式标注主题元数据。我们的方法采用GPT-3.5,结合为标注主题元数据设计的提示,在自动元数据标注中展现出令人期待的性能。然而,基于上下文学习的模型无法获取学科特定规则,导致其在多个类别中表现较低。这一局限性源于用于主题推断的上下文信息有限。据我们所知,我们首次引入了一种利用大语言模型进行自动化主题元数据标注的上下文学习方法。