In this paper, we first situate the challenges for measuring data quality under Project Lighthouse in the broader academic context. We then discuss in detail the three core data quality metrics we use for measurement--two of which extend prior academic work. Using those data quality metrics as examples, we propose a framework, based on machine learning classification, for empirically justifying the choice of data quality metrics and their associated minimum thresholds. Finally we outline how these methods enable us to rigorously meet the principle of data minimization when analyzing potential experience gaps under Project Lighthouse, which we term quantitative data minimization.
翻译:本文首先将“灯塔计划”中数据质量测量所面临的挑战置于更广泛的学术背景中进行定位。随后,我们详细讨论了用于测量的三个核心数据质量指标——其中两项指标是对先前学术工作的延伸。以这些数据质量指标为例,我们提出一个基于机器学习分类的框架,用于实证论证数据质量指标及其相关最低阈值选择的合理性。最后,我们概述了这些方法如何使我们能够在分析“灯塔计划”下潜在体验差距时,严格遵循数据最小化原则——我们将其称为定量数据最小化。