In the era of advanced artificial intelligence, highlighted by large-scale generative models like GPT-4, ensuring the traceability, verifiability, and reproducibility of datasets throughout their lifecycle is paramount for research institutions and technology companies. These organisations increasingly rely on vast corpora to train and fine-tune advanced AI models, resulting in intricate data supply chains that demand effective data governance mechanisms. In addition, the challenge intensifies as diverse stakeholders may use assorted tools, often without adequate measures to ensure the accountability of data and the reliability of outcomes. In this study, we adapt the concept of ``Software Bill of Materials" into the field of data governance and management to address the above challenges, and introduce ``Data Bill of Materials" (DataBOM) to capture the dependency relationship between different datasets and stakeholders by storing specific metadata. We demonstrate a platform architecture for providing blockchain-based DataBOM services, present the interaction protocol for stakeholders, and discuss the minimal requirements for DataBOM metadata. The proposed solution is evaluated in terms of feasibility and performance via case study and quantitative analysis respectively.
翻译:在由GPT-4等大规模生成模型为代表的高级人工智能时代,确保数据集在其整个生命周期中的可追溯性、可验证性与可复现性,对研究机构和技术企业至关重要。这些组织日益依赖海量语料库训练和微调先进AI模型,形成了需要有效数据治理机制的复杂数据供应链。此外,当不同利益相关方使用多样化工具却往往缺乏确保数据可追溯性与结果可靠性的有效措施时,这一挑战变得尤为严峻。本研究将"软件物料清单"概念引入数据治理与管理领域以应对上述挑战,提出通过存储特定元数据来记录不同数据集及利益相关方间依赖关系的"数据物料清单"。我们展示了一种提供基于区块链的DataBOM服务的平台架构,阐述了利益相关方的交互协议,并探讨了DataBOM元数据的最低要求。通过案例研究与定量分析,分别从可行性与性能维度对提出的解决方案进行了评估。