A Fourth Wave of Open Data? Exploring the Spectrum of Scenarios for Open Data and Generative AI

Since late 2022, generative AI has taken the world by storm, with widespread use of tools including ChatGPT, Gemini, and Claude. Generative AI and large language model (LLM) applications are transforming how individuals find and access data and knowledge. However, the intricate relationship between open data and generative AI, and the vast potential it holds for driving innovation in this field remain underexplored areas. This white paper seeks to unpack the relationship between open data and generative AI and explore possible components of a new Fourth Wave of Open Data: Is open data becoming AI ready? Is open data moving towards a data commons approach? Is generative AI making open data more conversational? Will generative AI improve open data quality and provenance? Towards this end, we provide a new Spectrum of Scenarios framework. This framework outlines a range of scenarios in which open data and generative AI could intersect and what is required from a data quality and provenance perspective to make open data ready for those specific scenarios. These scenarios include: pertaining, adaptation, inference and insight generation, data augmentation, and open-ended exploration. Through this process, we found that in order for data holders to embrace generative AI to improve open data access and develop greater insights from open data, they first must make progress around five key areas: enhance transparency and documentation, uphold quality and integrity, promote interoperability and standards, improve accessibility and useability, and address ethical considerations.

翻译：自2022年末以来，生成式AI席卷全球，ChatGPT、Gemini和Claude等工具得到广泛应用。生成式AI与大型语言模型（LLM）应用正在改变个体获取数据和知识的方式。然而，开放数据与生成式AI之间错综复杂的关系，以及其在驱动该领域创新方面的巨大潜力，仍属未被充分探索的领域。本白皮书旨在阐释开放数据与生成式AI的关联，并探索新型"第四波开放数据浪潮"的可能构成要素：开放数据是否正在向AI就绪方向演进？开放数据是否正迈向数据共享模式？生成式AI是否使开放数据更具对话性？生成式AI能否提升开放数据质量与溯源能力？为此，我们提出全新的"场景谱系"框架。该框架描绘了开放数据与生成式AI可能交叉的一系列场景，并从数据质量与溯源角度阐明使开放数据适配特定场景所需的条件。这些场景包括：预训练、适配、推理与洞察生成、数据增强，以及开放式探索。通过研究我们发现，若数据持有者希望借助生成式AI改善开放数据访问并从中获得更深层的洞察，就必须在五个关键领域取得进展：增强透明度与文档记录、维护质量与完整性、促进互操作性与标准化、改善可访问性与易用性，并应对伦理考量。

相关内容

生成式人工智能

关注 38

生成式人工智能是利用复杂的算法、模型和规则，从大规模数据集中学习，以创造新的原创内容的人工智能技术。这项技术能够创造文本、图片、声音、视频和代码等多种类型的内容，全面超越了传统软件的数据处理和分析能力。2022年末，OpenAI推出的ChatGPT标志着这一技术在文本生成领域取得了显著进展，2023年被称为生成式人工智能的突破之年。这项技术从单一的语言生成逐步向多模态、具身化快速发展。在图像生成方面，生成系统在解释提示和生成逼真输出方面取得了显著的进步。同时，视频和音频的生成技术也在迅速发展，这为虚拟现实和元宇宙的实现提供了新的途径。生成式人工智能技术在各行业、各领域都具有广泛的应用前景。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日