Since late 2022, generative AI has taken the world by storm, with widespread use of tools including ChatGPT, Gemini, and Claude. Generative AI and large language model (LLM) applications are transforming how individuals find and access data and knowledge. However, the intricate relationship between open data and generative AI, and the vast potential it holds for driving innovation in this field remain underexplored areas. This white paper seeks to unpack the relationship between open data and generative AI and explore possible components of a new Fourth Wave of Open Data: Is open data becoming AI ready? Is open data moving towards a data commons approach? Is generative AI making open data more conversational? Will generative AI improve open data quality and provenance? Towards this end, we provide a new Spectrum of Scenarios framework. This framework outlines a range of scenarios in which open data and generative AI could intersect and what is required from a data quality and provenance perspective to make open data ready for those specific scenarios. These scenarios include: pertaining, adaptation, inference and insight generation, data augmentation, and open-ended exploration. Through this process, we found that in order for data holders to embrace generative AI to improve open data access and develop greater insights from open data, they first must make progress around five key areas: enhance transparency and documentation, uphold quality and integrity, promote interoperability and standards, improve accessibility and useability, and address ethical considerations.
翻译:自2022年末以来,生成式AI席卷全球,ChatGPT、Gemini和Claude等工具得到广泛应用。生成式AI与大型语言模型(LLM)应用正在改变个体获取数据和知识的方式。然而,开放数据与生成式AI之间错综复杂的关系,以及其在驱动该领域创新方面的巨大潜力,仍属未被充分探索的领域。本白皮书旨在阐释开放数据与生成式AI的关联,并探索新型"第四波开放数据浪潮"的可能构成要素:开放数据是否正在向AI就绪方向演进?开放数据是否正迈向数据共享模式?生成式AI是否使开放数据更具对话性?生成式AI能否提升开放数据质量与溯源能力?为此,我们提出全新的"场景谱系"框架。该框架描绘了开放数据与生成式AI可能交叉的一系列场景,并从数据质量与溯源角度阐明使开放数据适配特定场景所需的条件。这些场景包括:预训练、适配、推理与洞察生成、数据增强,以及开放式探索。通过研究我们发现,若数据持有者希望借助生成式AI改善开放数据访问并从中获得更深层的洞察,就必须在五个关键领域取得进展:增强透明度与文档记录、维护质量与完整性、促进互操作性与标准化、改善可访问性与易用性,并应对伦理考量。