New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in documenting data transparency, tracing authenticity, verifying consent, privacy, representation, bias, copyright infringement, and the overall development of ethical and trustworthy foundation models. In response, regulation is emphasizing the need for training data transparency to understand foundation models' limitations. Based on a large-scale analysis of the foundation model training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible foundation model development practices. We examine the current shortcomings of common tools for tracing data authenticity, consent, and documentation, and outline how policymakers, developers, and data creators can facilitate responsible foundation model development by adopting universal data provenance standards.
翻译:摘要:基础模型的新能力在很大程度上归功于大规模、来源广泛且文档不足的训练数据集合。当前的数据收集实践在记录数据透明度、追溯真实性、验证同意、隐私、代表性、偏见、版权侵权以及整体开发合乎道德且值得信赖的基础模型方面带来了挑战。为此,法规强调需要训练数据透明度以理解基础模型的局限性。基于对基础模型训练数据格局及现有解决方案的大规模分析,我们识别了缺失的基础设施,以促进负责任的基础模型开发实践。我们审视了当前用于追溯数据真实性、同意和文档记录的常见工具的不足,并概述了政策制定者、开发者和数据创建者如何通过采用通用的数据来源标准来促进负责任的基础模型开发。