Machine learning (ML) models in production pipelines are frequently retrained on the latest partitions of large, continually-growing datasets. Due to engineering bugs, partitions in such datasets almost always have some corrupted features; thus, it's critical to detect data issues and block retraining before downstream ML model accuracy decreases. However, it's difficult to identify when a partition is corrupted enough to block retraining. Blocking too often yields stale model snapshots in production; blocking too little yields broken model snapshots in production. In this paper, we present an automatic data validation system for ML pipelines implemented at Meta. We employ what we call a Partition Summarization (PS) approach to data validation: each timestamp-based partition of data is summarized with data quality metrics, and summaries are compared to detect corrupted partitions. We describe how we can adapt PS for several data validation methods and compare their pros and cons. Since none of the methods by themselves met our requirements for high precision and recall in detecting corruptions, we devised GATE, our high-precision and recall data validation method. GATE gave a 2.1x average improvement in precision over the baseline on a case study with Instagram's data. Finally, we discuss lessons learned from implementing data validation for Meta's production ML pipelines.
翻译:生产流水线中的机器学习(ML)模型频繁在持续增长的大规模数据集上基于最新分区进行重新训练。由于工程缺陷,此类数据集中的分区几乎总包含某些损坏特征;因此,在下游ML模型精度下降前检测数据问题并阻止重新训练至关重要。然而,难以判断分区损坏到何种程度时应阻止重训练:过于频繁地阻止会导致生产环境中部署过时的模型快照,而阻止不足则会产生损坏的模型快照。本文提出了一个在Meta实现的ML流水线自动数据验证系统。我们采用一种称为分区摘要(PS)的方法进行数据验证:通过数据质量指标对每个基于时间戳的数据分区进行摘要,并通过比较摘要检测损坏分区。我们阐述了如何将PS适配于多种数据验证方法,并比较了各自的优缺点。由于单种方法均无法满足我们对损坏检测高精确率与召回率的要求,我们设计了GATE——一种兼具高精确率与召回率的数据验证方法。在Instagram数据的案例研究中,GATE的精确率相比基线平均提升2.1倍。最后,我们总结了在Meta生产ML流水线中实现数据验证的经验教训。