Towards a more realistic evaluation of machine learning models for bearing fault diagnosis

Reliable detection of bearing faults is essential for maintaining the safety and operational efficiency of rotating machinery. While recent advances in machine learning (ML), particularly deep learning, have shown strong performance in controlled settings, many studies fail to generalize to real-world applications due to methodological flaws, most notably data leakage. This paper investigates the issue of data leakage in vibration-based bearing fault diagnosis and its impact on model evaluation. We demonstrate that common dataset partitioning strategies, such as segment-wise and condition-wise splits, introduce spurious correlations that inflate performance metrics. To address this, we propose a rigorous, leakage-free evaluation methodology centered on bearing-wise data partitioning, ensuring no overlap between the physical components used for training and testing. Additionally, we reformulate the classification task as a multi-label problem, enabling the detection of co-occurring fault types and the use of prevalence-independent metrics such as Macro AUROC. Beyond preventing leakage, we also examine the effect of dataset diversity on generalization, showing that the number of unique training bearings is a decisive factor for achieving robust performance. We evaluate our methodology on three widely adopted datasets: CWRU, Paderborn University (PU), and University of Ottawa (UORED-VAFCLS). This study highlights the importance of leakage-aware evaluation protocols and provides practical guidelines for dataset partitioning, model selection, and validation, fostering the development of more trustworthy ML systems for industrial fault diagnosis applications.

翻译：轴承故障的可靠检测对于维持旋转机械的安全与运行效率至关重要。尽管机器学习（ML），尤其是深度学习，在受控环境中已展现出强劲性能，但由于方法学缺陷（最显著的是数据泄露），许多研究难以推广至实际应用。本文研究了基于振动的轴承故障诊断中的数据泄露问题及其对模型评估的影响。我们证明，常见的数据集划分策略（如按段划分和按工况划分）会引入虚假相关性，从而夸大性能指标。为解决此问题，我们提出了一种以轴承为单位数据划分的严谨、无泄露评估方法，确保用于训练和测试的物理部件之间无重叠。此外，我们将分类任务重新定义为多标签问题，使其能够检测共现的故障类型，并使用与故障发生率无关的指标（如宏平均AUROC）。除了防止数据泄露，我们还研究了数据集多样性对泛化能力的影响，表明独特训练轴承的数量是实现稳健性能的决定性因素。我们在三个广泛采用的数据集上评估了我们的方法：CWRU、帕德博恩大学（PU）和渥太华大学（UORED-VAFCLS）。本研究强调了采用防泄露评估协议的重要性，并为数据集划分、模型选择和验证提供了实用指南，以促进开发更值得信赖的工业故障诊断ML系统。