Between Randomness and Arbitrariness: Some Lessons for Reliable Machine Learning at Scale

To develop rigorous knowledge about ML models -- and the systems in which they are embedded -- we need reliable measurements. But reliable measurement is fundamentally challenging, and touches on issues of reproducibility, scalability, uncertainty quantification, epistemology, and more. This dissertation addresses criteria needed to take reliability seriously: both criteria for designing meaningful metrics, and for methodologies that ensure that we can dependably and efficiently measure these metrics at scale and in practice. In doing so, this dissertation articulates a research vision for a new field of scholarship at the intersection of machine learning, law, and policy. Within this frame, we cover topics that fit under three different themes: (1) quantifying and mitigating sources of arbitrariness in ML, (2) taming randomness in uncertainty estimation and optimization algorithms, in order to achieve scalability without sacrificing reliability, and (3) providing methods for evaluating generative-AI systems, with specific focuses on quantifying memorization in language models and training latent diffusion models on open-licensed data. By making contributions in these three themes, this dissertation serves as an empirical proof by example that research on reliable measurement for machine learning is intimately and inescapably bound up with research in law and policy. These different disciplines pose similar research questions about reliable measurement in machine learning. They are, in fact, two complementary sides of the same research vision, which, broadly construed, aims to construct machine-learning systems that cohere with broader societal values.

翻译：要发展关于机器学习模型及其嵌入系统的严谨知识，我们需要可靠的测量。然而，可靠的测量本质上具有挑战性，涉及可复现性、可扩展性、不确定性量化、认识论等诸多问题。本论文探讨了认真对待可靠性所需的标准：既包括设计有意义指标的标准，也包括确保我们能够在大规模实践中可靠且高效地测量这些指标的方法论标准。在此过程中，本论文阐述了一个新兴交叉学科领域的研究愿景，该领域融合了机器学习、法律与政策。在此框架下，我们涵盖了三个不同主题的研究内容：(1) 量化并缓解机器学习中的任意性来源，(2) 通过控制不确定性估计与优化算法中的随机性，在实现可扩展性的同时不牺牲可靠性，以及 (3) 提供生成式人工智能系统的评估方法，特别聚焦于量化语言模型的记忆效应以及在开放许可数据上训练潜在扩散模型。通过对这三个主题的贡献，本论文通过实例实证表明：机器学习可靠测量的研究与法律政策研究存在着内在且必然的紧密联系。这些不同学科对机器学习可靠测量提出了相似的研究问题。实际上，它们是同一研究愿景中两个互补的方面——广义而言，该愿景旨在构建与更广泛社会价值相协调的机器学习系统。