Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.5). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base
翻译:元数据在确保数据集的可发现性、可访问性、互操作性和可重用性方面发挥着关键作用。本文研究了大型语言模型(特别是GPT-4)在提升元数据标准遵循度方面的潜力。我们在NCBI BioSample存储库中随机选取200条描述肺癌相关人类样本的数据记录进行实验,评估GPT-4对元数据标准遵循性提出编辑建议的能力。通过同行评审流程计算字段名-字段值对的遵循准确率,我们观察到遵循标准数据字典的平均准确率从79%微幅提升至80%(p<0.5)。随后,我们以CEDAR模板文本描述的形式向GPT-4提供领域信息,记录到遵循准确率从79%显著提升至97%(p<0.01)。这些结果表明,虽然未经辅助的大型语言模型可能无法有效修正遗留元数据以达到令人满意的标准遵循度,但当其与结构化知识库集成时,确实在自动化元数据管理方面展现出应用潜力。