Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models. While recent data-centric methods are able to identify such inconsistencies with respect to the training set, they suffer from two key limitations: (1) suboptimality in settings where features exhibit statistical independencies, due to their usage of compressive representations and (2) lack of localization to pin-point why a sample might be flagged as inconsistent, which is important to guide future data collection. We solve these two fundamental limitations using directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure. Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions. DAGnosis unlocks the localization of the causes of inconsistencies on a DAG, an aspect overlooked by previous approaches. Moreover, we show empirically that leveraging these interactions (1) leads to more accurate conclusions in detecting inconsistencies, as well as (2) provides more detailed insights into why some samples are flagged.
翻译:在部署阶段识别并恰当处理数据中的不一致性,对于可靠使用机器学习模型至关重要。尽管近期以数据为中心的方法能够基于训练集识别此类不一致性,但它们存在两个关键局限:(1)由于使用压缩表示,当特征呈现统计独立性时表现欠佳;(2)缺乏定位能力以精确指出样本被标记为不一致的原因,而这对于指导未来数据收集至关重要。我们通过使用有向无环图(DAG)编码训练集特征的概率分布及独立性结构,解决了这两个根本性局限。我们的方法名为DAGnosis,利用这些结构交互得出有价值且具洞察力的以数据为中心的结论。DAGnosis实现了在DAG上定位不一致性根源的功能,这一方面被先前方法所忽视。此外,我们通过实验证明,利用这些交互(1)在检测不一致性时可得出更准确的结论,并且(2)提供了更详细的洞见,揭示为何某些样本被标记。