Metadata play a crucial role in ensuring the findability, accessibility, interoperability, and reusability of datasets. This paper investigates the potential of large language models (LLMs), specifically GPT-4, to improve adherence to metadata standards. We conducted experiments on 200 random data records describing human samples relating to lung cancer from the NCBI BioSample repository, evaluating GPT-4's ability to suggest edits for adherence to metadata standards. We computed the adherence accuracy of field name-field value pairs through a peer review process, and we observed a marginal average improvement in adherence to the standard data dictionary from 79% to 80% (p<0.01). We then prompted GPT-4 with domain information in the form of the textual descriptions of CEDAR templates and recorded a significant improvement to 97% from 79% (p<0.01). These results indicate that, while LLMs may not be able to correct legacy metadata to ensure satisfactory adherence to standards when unaided, they do show promise for use in automated metadata curation when integrated with a structured knowledge base.
翻译:元数据在确保数据集的可发现性、可访问性、互操作性和可重用性方面发挥着关键作用。本文研究了大型语言模型(LLMs),特别是GPT-4,在提高元数据标准符合性方面的潜力。我们针对来自NCBI BioSample库中描述肺癌相关人体样本的200条随机数据记录开展了实验,评估GPT-4为符合元数据标准提出修改建议的能力。通过同行评审流程计算字段名-字段值对的符合准确率后,我们观察到,在标准数据字典中的符合率从79%小幅平均提升至80%(p<0.01)。随后,我们以CEDAR模板文本描述的形式向GPT-4提供领域信息,记录到符合率从79%显著提升至97%(p<0.01)。这些结果表明,虽然LLMs在无辅助条件下可能无法纠正历史元数据以达到令人满意的标准符合性,但当与结构化知识库集成时,它们在自动化元数据策展方面展现出应用前景。