Governments typically collect and steward a vast amount of high-quality data on their citizens and institutions, and the UK government is exploring how it can better publish and provision this data to the benefit of the AI landscape. However, the compositions of generative AI training corpora remain closely guarded secrets, making the planning of data sharing initiatives difficult. To address this, we devise two methods to assess UK government data usage for the training of Large Language Models (LLMs) and 'peek behind the curtain' in order to observe the UK government's current contributions as a data provider for AI. The first method, an ablation study that utilises LLM 'unlearning', seeks to examine the importance of the information held on UK government websites for LLMs and their performance in citizen query tasks. The second method, an information leakage study, seeks to ascertain whether LLMs are aware of the information held in the datasets published on the UK government's open data initiative data.gov.uk. Our findings indicate that UK government websites are important data sources for AI (heterogenously across subject matters) while data.gov.uk is not. This paper serves as a technical report, explaining in-depth the designs, mechanics, and limitations of the above experiments. It is accompanied by a complementary non-technical report on the ODI website in which we summarise the experiments and key findings, interpret them, and build a set of actionable recommendations for the UK government to take forward as it seeks to design AI policy. While we focus on UK open government data, we believe that the methods introduced in this paper present a reproducible approach to tackle the opaqueness of AI training corpora and provide organisations a framework to evaluate and maximize their contributions to AI development.
翻译:政府通常收集并管理着大量关于其公民和机构的高质量数据,英国政府正在探索如何更好地发布和提供这些数据以促进人工智能领域的发展。然而,生成式人工智能训练语料库的构成仍是严密保守的秘密,这使得数据共享计划的制定变得困难。为解决这一问题,我们设计了两种方法来评估英国政府数据在大型语言模型训练中的使用情况,并“窥探幕后”以观察英国政府作为人工智能数据提供者的当前贡献。第一种方法是利用LLM“反学习”的消融研究,旨在检验英国政府网站所载信息对LLMs的重要性及其在公民查询任务中的表现。第二种方法是信息泄露研究,旨在确定LLMs是否知晓英国政府开放数据计划data.gov.uk上发布的数据集中所包含的信息。我们的研究结果表明,英国政府网站是人工智能的重要数据源(在不同主题领域存在异质性),而data.gov.uk则不是。本文作为一份技术报告,深入阐述了上述实验的设计、机制和局限性。同时,我们在开放数据研究所网站上发布了一份补充性非技术报告,其中我们总结了实验和关键发现,对其进行解读,并为英国政府制定人工智能政策提出了一系列可操作的建议。虽然我们关注的是英国开放政府数据,但我们相信本文介绍的方法为解决人工智能训练语料库不透明问题提供了一种可复现的途径,并为各组织评估和最大化其对人工智能发展的贡献提供了一个框架。