Data verification, the process of labeling data items as correct or incorrect, is a preprocessing step that may critically affect the quality of results in data-driven pipelines. Despite recent advances, verification can still produce erroneous labels that propagate to downstream query results in complex ways. We present a framework that complements existing verification tools by assessing the impact of potential labeling errors on query outputs and guiding additional verification steps to improve result reliability. To this end, we introduce Maximal Error Score (MES), a worst-case uncertainty metric that quantifies the reliability of query output tuples independently of the underlying data distribution. As an auxiliary indicator, we identify risky tuples - input tuples for which reducing label uncertainty may counterintuitively increase the output uncertainty. We then develop efficient algorithms for computing MES and detecting risky tuples, as well as a generic algorithm, named MESReduce, that builds on both indicators and interacts with external verifiers to select effective additional verification steps. We implement our techniques in a prototype system and evaluate them on real and synthetic datasets, demonstrating that MESReduce can substantially and effectively reduce the MES and improve the accuracy of verification results.
翻译:数据验证——将数据项标记为正确或错误的过程——是数据驱动流程中可能严重影响结果质量的预处理步骤。尽管近期取得进展,验证过程仍可能产生错误标签,这些标签会以复杂方式传播至下游查询结果。我们提出了一个框架,通过评估潜在标签错误对查询输出的影响并指导额外的验证步骤以提高结果可靠性,从而对现有验证工具形成补充。为此,我们引入了最大误差分数(MES),这是一种独立于底层数据分布、用于量化查询输出元组可靠性的最坏情况不确定性度量。作为辅助指标,我们识别出风险元组——即那些减少标签不确定性可能反而增加输出不确定性的输入元组。随后我们开发了计算MES和检测风险元组的有效算法,以及一个名为MESReduce的通用算法。该算法基于这两个指标,通过与外部验证器交互来选择有效的额外验证步骤。我们在原型系统中实现了这些技术,并在真实与合成数据集上进行了评估,证明MESReduce能够显著且有效地降低MES并提升验证结果的准确性。