How can we assess the reliability of a dataset without access to ground truth? We introduce the problem of reliability scoring for datasets collected from potentially strategic sources. The true data are unobserved, but we see outcomes of an unknown statistical experiment that depends on them. To benchmark reliability, we define ground-truth-based orderings that capture how much reported data deviate from the truth. We then propose the Gram determinant score, which measures the volume spanned by vectors describing the empirical distribution of the observed data and experiment outcomes. We show that this score preserves several ground-truth based reliability orderings and, uniquely up to scaling, yields the same reliability ranking of datasets regardless of the experiment -- a property we term experiment agnosticism. Experiments on synthetic noise models, CIFAR-10 embeddings, and real employment data demonstrate that the Gram determinant score effectively captures data quality across diverse observation processes.
翻译:在无法获取真实数据的情况下,我们如何评估数据集的可靠性?本文提出了针对可能来自策略性来源的数据集的可靠性评分问题。真实数据虽不可观测,但我们能观察到依赖于这些真实数据的未知统计实验的结果。为建立可靠性基准,我们定义了基于真实数据的排序关系,用以量化报告数据偏离真实值的程度。随后,我们提出格拉姆行列式评分法,该方法通过描述观测数据与实验结果经验分布向量的张成空间体积来度量可靠性。我们证明该评分法能保持多种基于真实数据的可靠性排序关系,且具有实验无关性——即除尺度变换外,该评分法对任意实验均能产生一致的数据集可靠性排序。通过对合成噪声模型、CIFAR-10嵌入表示及实际就业数据的实验验证,格拉姆行列式评分法能有效刻画不同观测过程中的数据质量。