The generalisation of Neural Networks (NN) to multiple datasets is often overlooked in literature due to NNs typically being optimised for specific data sources. This becomes especially challenging in time-series-based multi-dataset models due to difficulties in fusing sequential data from different sensors and collection specifications. In a commercial environment, however, generalisation can effectively utilise available data and computational power, which is essential in the context of Green AI, the sustainable development of AI models. This paper introduces "Dataset Fusion," a novel dataset composition algorithm for fusing periodic signals from multiple homogeneous datasets into a single dataset while retaining unique features for generalised anomaly detection. The proposed approach, tested on a case study of 3-phase current data from 2 different homogeneous Induction Motor (IM) fault datasets using an unsupervised LSTMCaps NN, significantly outperforms conventional training approaches with an Average F1 score of 0.879 and effectively generalises across all datasets. The proposed approach was also tested with varying percentages of the training data, in line with the principles of Green AI. Results show that using only 6.25\% of the training data, translating to a 93.7\% reduction in computational power, results in a mere 4.04\% decrease in performance, demonstrating the advantages of the proposed approach in terms of both performance and computational efficiency. Moreover, the algorithm's effectiveness under non-ideal conditions highlights its potential for practical use in real-world applications.
翻译:神经网络(NN)对不同数据集的泛化能力在文献中常被忽略,原因在于神经网络通常针对特定数据源进行优化。在基于时间序列的多数据集模型中,由于不同传感器和采集规范下的序列数据融合存在困难,这一挑战尤为突出。然而在商业环境中,泛化能力能有效利用现有数据和计算资源,这对绿色人工智能(Green AI,即AI模型的可持续发展)至关重要。本文提出"数据集融合"——一种新型数据集组合算法,可将多个同质数据集的周期性信号融合为单一数据集,同时保留用于广义异常检测的独有特征。该方法在基于2个同质感应电机(IM)故障数据集的三相电流数据案例中,使用无监督LSTMCaps神经网络进行测试,其平均F1分数达0.879,显著优于传统训练方法,并能有效跨数据集泛化。该方法还根据绿色人工智能原则,在不同比例训练数据下进行了测试。结果表明,仅使用6.25%的训练数据(可降低93.7%计算开销)仅导致4.04%的性能下降,证明了本方法在性能和计算效率方面的双重优势。此外,该算法在非理想条件下的有效性凸显了其在实际应用中的潜力。