WeShap: Weak Supervision Source Evaluation with Shapley Values

Efficient data annotation stands as a significant bottleneck in training contemporary machine learning models. The Programmatic Weak Supervision (PWS) pipeline presents a solution by utilizing multiple weak supervision sources to automatically label data, thereby expediting the annotation process. Given the varied contributions of these weak supervision sources to the accuracy of PWS, it is imperative to employ a robust and efficient metric for their evaluation. This is crucial not only for understanding the behavior and performance of the PWS pipeline but also for facilitating corrective measures. In our study, we introduce WeShap values as an evaluation metric, which quantifies the average contribution of weak supervision sources within a proxy PWS pipeline, leveraging the theoretical underpinnings of Shapley values. We demonstrate efficient computation of WeShap values using dynamic programming, achieving quadratic computational complexity relative to the number of weak supervision sources. Our experiments demonstrate the versatility of WeShap values across various applications, including the identification of beneficial or detrimental labeling functions, refinement of the PWS pipeline, and rectification of mislabeled data. Furthermore, WeShap values aid in comprehending the behavior of the PWS pipeline and scrutinizing specific instances of mislabeled data. Although initially derived from a specific proxy PWS pipeline, we empirically demonstrate the generalizability of WeShap values to other PWS pipeline configurations. Our findings indicate a noteworthy average improvement of 4.8 points in downstream model accuracy through the revision of the PWS pipeline compared to previous state-of-the-art methods, underscoring the efficacy of WeShap values in enhancing data quality for training machine learning models.

翻译：高效的数据标注是训练现代机器学习模型的一个主要瓶颈。程序化弱监督（PWS）流程通过利用多个弱监督源自动标注数据，从而加速标注过程，为此提供了一种解决方案。鉴于这些弱监督源对PWS准确性的贡献各不相同，采用稳健且高效的度量标准对其进行评估至关重要。这不仅对于理解PWS流程的行为和性能至关重要，也有助于采取纠正措施。在本研究中，我们引入WeShap值作为一种评估指标，它利用沙普利值的理论基础，量化了代理PWS流程中弱监督源的平均贡献。我们展示了使用动态规划高效计算WeShap值的方法，实现了相对于弱监督源数量的二次计算复杂度。我们的实验证明了WeShap值在各种应用中的多功能性，包括识别有益或有害的标注函数、优化PWS流程以及纠正误标数据。此外，WeShap值有助于理解PWS流程的行为并审查特定的误标数据实例。尽管最初是从特定的代理PWS流程推导而来，但我们通过实证证明了WeShap值对其他PWS流程配置的普适性。我们的研究结果表明，与先前最先进的方法相比，通过修订PWS流程，下游模型准确性平均显著提升了4.8个百分点，这突显了WeShap值在提升用于训练机器学习模型的数据质量方面的有效性。