An Active Learning Framework with a Class Balancing Strategy for Time Series Classification

from arxiv, Master's thesis accepted by Memorial University of Newfoundland. Chapter 3 published in the Journal of Frontiers in Robotics and AI. Chapter 4 published in the IEEE Systems Conference 2024

Training machine learning models for classification tasks often requires labeling numerous samples, which is costly and time-consuming, especially in time series analysis. This research investigates Active Learning (AL) strategies to reduce the amount of labeled data needed for effective time series classification. Traditional AL techniques cannot control the selection of instances per class for labeling, leading to potential bias in classification performance and instance selection, particularly in imbalanced time series datasets. To address this, we propose a novel class-balancing instance selection algorithm integrated with standard AL strategies. Our approach aims to select more instances from classes with fewer labeled examples, thereby addressing imbalance in time series datasets. We demonstrate the effectiveness of our AL framework in selecting informative data samples for two distinct domains of tactile texture recognition and industrial fault detection. In robotics, our method achieves high-performance texture categorization while significantly reducing labeled training data requirements to 70%. We also evaluate the impact of different sliding window time intervals on robotic texture classification using AL strategies. In synthetic fiber manufacturing, we adapt AL techniques to address the challenge of fault classification, aiming to minimize data annotation cost and time for industries. We also address real-life class imbalances in the multiclass industrial anomalous dataset using our class-balancing instance algorithm integrated with AL strategies. Overall, this thesis highlights the potential of our AL framework across these two distinct domains.

翻译：训练用于分类任务的机器学习模型通常需要标注大量样本，这一过程成本高昂且耗时，尤其在时间序列分析领域。本研究探索主动学习策略，以减少有效时间序列分类所需的标注数据量。传统主动学习技术无法控制每类实例的标注选择，导致分类性能与实例选择出现潜在偏差，尤其在非平衡时间序列数据集中。为解决此问题，我们提出一种与标准主动学习策略相结合的新型类别平衡实例选择算法。该方法倾向于从标注样本较少的类别中选取更多实例，从而缓解时间序列数据集的非平衡问题。我们通过触觉纹理识别与工业故障检测两个不同领域，验证了所提主动学习框架在选取信息性数据样本方面的有效性。在机器人领域，该方法在实现高性能纹理分类的同时，仅需70%的标注训练数据即可达成。我们还评估了不同滑动窗口时间间隔对基于主动学习策略的机器人纹理分类的影响。在合成纤维制造领域，我们改进主动学习技术以解决故障分类难题，旨在降低工业数据标注成本与时间。针对多类别工业异常数据集中的现实类别非平衡问题，我们采用融合类别平衡实例算法与主动学习策略的方案予以应对。总体而言，本论文凸显了所提出的主动学习框架在这两个不同领域的应用潜力。