Methods to Assess the UK Government's Current Role as a Data Provider for AI

from arxiv, 17 pages, 5 figures; v2 - incorporated editor feedback; for the accompanying, non-technical ODI report see https://theodi.org/insights/reports/the-uk-government-as-a-data-provider-for-ai

Governments typically collect and steward a vast amount of high-quality data on their citizens and institutions, and the UK government is exploring how it can better publish and provision this data to the benefit of the AI landscape. However, the compositions of generative AI training corpora remain closely guarded secrets, making the planning of data sharing initiatives difficult. To address this, we devise two methods to assess UK government data usage for the training of Large Language Models (LLMs) and 'peek behind the curtain' in order to observe the UK government's current contributions as a data provider for AI. The first method, an ablation study that utilises LLM 'unlearning', seeks to examine the importance of the information held on UK government websites for LLMs and their performance in citizen query tasks. The second method, an information leakage study, seeks to ascertain whether LLMs are aware of the information held in the datasets published on the UK government's open data initiative data.gov.uk. Our findings indicate that UK government websites are important data sources for AI (heterogenously across subject matters) while data.gov.uk is not. This paper serves as a technical report, explaining in-depth the designs, mechanics, and limitations of the above experiments. It is accompanied by a complementary non-technical report on the ODI website in which we summarise the experiments and key findings, interpret them, and build a set of actionable recommendations for the UK government to take forward as it seeks to design AI policy. While we focus on UK open government data, we believe that the methods introduced in this paper present a reproducible approach to tackle the opaqueness of AI training corpora and provide organisations a framework to evaluate and maximize their contributions to AI development.

翻译：政府通常收集并管理着大量关于其公民和机构的高质量数据，英国政府正在探索如何更好地发布和提供这些数据以促进人工智能领域的发展。然而，生成式人工智能训练语料库的构成仍是严密保守的秘密，这使得数据共享计划的制定变得困难。为解决这一问题，我们设计了两种方法来评估英国政府数据在大型语言模型训练中的使用情况，并“窥探幕后”以观察英国政府作为人工智能数据提供者的当前贡献。第一种方法是利用LLM“反学习”的消融研究，旨在检验英国政府网站所载信息对LLMs的重要性及其在公民查询任务中的表现。第二种方法是信息泄露研究，旨在确定LLMs是否知晓英国政府开放数据计划data.gov.uk上发布的数据集中所包含的信息。我们的研究结果表明，英国政府网站是人工智能的重要数据源（在不同主题领域存在异质性），而data.gov.uk则不是。本文作为一份技术报告，深入阐述了上述实验的设计、机制和局限性。同时，我们在开放数据研究所网站上发布了一份补充性非技术报告，其中我们总结了实验和关键发现，对其进行解读，并为英国政府制定人工智能政策提出了一系列可操作的建议。虽然我们关注的是英国开放政府数据，但我们相信本文介绍的方法为解决人工智能训练语料库不透明问题提供了一种可复现的途径，并为各组织评估和最大化其对人工智能发展的贡献提供了一个框架。

相关内容

关注 7103

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日