A Dataset Fusion Algorithm for Generalised Anomaly Detection in Homogeneous Periodic Time Series Datasets

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

The generalisation of Neural Networks (NN) to multiple datasets is often overlooked in literature due to NNs typically being optimised for specific data sources. This becomes especially challenging in time-series-based multi-dataset models due to difficulties in fusing sequential data from different sensors and collection specifications. In a commercial environment, however, generalisation can effectively utilise available data and computational power, which is essential in the context of Green AI, the sustainable development of AI models. This paper introduces "Dataset Fusion," a novel dataset composition algorithm for fusing periodic signals from multiple homogeneous datasets into a single dataset while retaining unique features for generalised anomaly detection. The proposed approach, tested on a case study of 3-phase current data from 2 different homogeneous Induction Motor (IM) fault datasets using an unsupervised LSTMCaps NN, significantly outperforms conventional training approaches with an Average F1 score of 0.879 and effectively generalises across all datasets. The proposed approach was also tested with varying percentages of the training data, in line with the principles of Green AI. Results show that using only 6.25\% of the training data, translating to a 93.7\% reduction in computational power, results in a mere 4.04\% decrease in performance, demonstrating the advantages of the proposed approach in terms of both performance and computational efficiency. Moreover, the algorithm's effectiveness under non-ideal conditions highlights its potential for practical use in real-world applications.

翻译：神经网络（NN）对不同数据集的泛化能力在文献中常被忽略，原因在于神经网络通常针对特定数据源进行优化。在基于时间序列的多数据集模型中，由于不同传感器和采集规范下的序列数据融合存在困难，这一挑战尤为突出。然而在商业环境中，泛化能力能有效利用现有数据和计算资源，这对绿色人工智能（Green AI，即AI模型的可持续发展）至关重要。本文提出"数据集融合"——一种新型数据集组合算法，可将多个同质数据集的周期性信号融合为单一数据集，同时保留用于广义异常检测的独有特征。该方法在基于2个同质感应电机（IM）故障数据集的三相电流数据案例中，使用无监督LSTMCaps神经网络进行测试，其平均F1分数达0.879，显著优于传统训练方法，并能有效跨数据集泛化。该方法还根据绿色人工智能原则，在不同比例训练数据下进行了测试。结果表明，仅使用6.25%的训练数据（可降低93.7%计算开销）仅导致4.04%的性能下降，证明了本方法在性能和计算效率方面的双重优势。此外，该算法在非理想条件下的有效性凸显了其在实际应用中的潜力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

不可错过！杜克大学《因果推断》课程，全面讲述因果推理

专知会员服务

52+阅读 · 2022年10月22日

不可错过！700+ppt《因果推理》课程！杜克大学Fan Li教程

专知会员服务

73+阅读 · 2022年7月11日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日