Recent regulatory initiatives like the European AI Act and relevant voices in the Machine Learning (ML) community stress the need to describe datasets along several key dimensions for trustworthy AI, such as the provenance processes and social concerns. However, this information is typically presented as unstructured text in accompanying documentation, hampering their automated analysis and processing. In this work, we explore using large language models (LLM) and a set of prompting strategies to automatically extract these dimensions from documents and enrich the dataset description with them. Our approach could aid data publishers and practitioners in creating machine-readable documentation to improve the discoverability of their datasets, assess their compliance with current AI regulations, and improve the overall quality of ML models trained on them. In this paper, we evaluate the approach on 12 scientific dataset papers published in two scientific journals (Nature's Scientific Data and Elsevier's Data in Brief) using two different LLMs (GPT3.5 and Flan-UL2). Results show good accuracy with our prompt extraction strategies. Concrete results vary depending on the dimensions, but overall, GPT3.5 shows slightly better accuracy (81,21%) than FLAN-UL2 (69,13%) although it is more prone to hallucinations. We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results, in an open-source repository.
翻译:近期如《欧洲人工智能法案》等监管举措以及机器学习(ML)社区的相关呼声均强调,为构建可信人工智能,需从多个关键维度描述数据集,例如数据来源流程与社会关切因素。然而,此类信息通常以非结构化文本形式呈现于附属文档中,阻碍了其自动化分析与处理。本研究探索利用大型语言模型(LLM)及一系列提示策略,自动从文档中提取这些维度信息,并以此丰富数据集描述。该方法可协助数据发布者与实践者创建机器可读的文档,从而提升数据集的检索便利性、评估其对现行人工智能法规的合规性,并改善基于这些数据集训练的机器学习模型的整体质量。本文通过在两种科学期刊(《自然》旗下《科学数据》与爱思唯尔《数据简报》)上发表的12篇科学数据集论文,使用两种不同的大型语言模型(GPT3.5与Flan-UL2)对该方法进行评估。结果显示,采用我们的提示提取策略可获得较高准确率。具体结果因维度而异,总体而言,GPT3.5虽更易产生幻觉,但其准确率(81.21%)略高于FLAN-UL2(69.13%)。我们已在开源仓库中发布了实现该方法的开源工具及复现资源包,包含实验代码与结果数据。