On the Influence of Data Resampling for Deep Learning-Based Log Anomaly Detection: Insights and Recommendations

Numerous DL-based approaches have garnered considerable attention in the field of software Log Anomaly Detection. However, a practical challenge persists: the class imbalance in the public data commonly used to train the DL models. This imbalance is characterized by a substantial disparity in the number of abnormal log sequences compared to normal ones, for example, anomalies represent less than 1% of one of the most popular datasets. Previous research has indicated that existing DLLAD approaches may exhibit unsatisfactory performance, particularly when confronted with datasets featuring severe class imbalances. Mitigating class imbalance through data resampling has proven effective for other software engineering tasks, however, it has been unexplored for LAD thus far. This study aims to fill this gap by providing an in-depth analysis of the impact of diverse data resampling methods on existing DLLAD approaches from two distinct perspectives. Firstly, we assess the performance of these DLLAD approaches across three datasets and explore the impact of resampling ratios of normal to abnormal data on ten data resampling methods. Secondly, we evaluate the effectiveness of the data resampling methods when utilizing optimal resampling ratios of normal to abnormal data. Our findings indicate that oversampling methods generally outperform undersampling and hybrid methods. Data resampling on raw data yields superior results compared to data resampling in the feature space. In most cases, certain undersampling and hybrid methods show limited effectiveness. Additionally, by exploring the resampling ratio of normal to abnormal data, we suggest generating more data for minority classes through oversampling while removing less data from majority classes through undersampling. In conclusion, our study provides valuable insights into the intricate relationship between data resampling methods and DLLAD.

翻译：基于深度学习的日志异常检测方法在软件工程领域引起了广泛关注。然而，一个实际挑战依然存在：用于训练深度学习模型的公共数据普遍存在类别不平衡问题。这种不平衡表现为异常日志序列与正常日志序列在数量上的显著差异——例如，在最具代表性的数据集中，异常样本占比不足1%。已有研究表明，现有深度学习日志异常检测方法在处理严重类别不平衡的数据集时可能表现欠佳。尽管数据重采样已被证明能有效缓解其他软件工程任务中的类别不平衡问题，但该技术在日志异常检测领域尚未被探索。本研究旨在填补这一空白，从两个独特视角深入分析不同数据重采样方法对现有深度学习日志异常检测方法的影响。首先，我们评估了这些方法在三个数据集上的性能表现，并探究了十种数据重采样方法中正常与异常数据重采样比例的影响。其次，我们评估了采用最优正常异常数据重采样比例时数据重采样方法的有效性。研究发现：过采样方法普遍优于欠采样和混合方法；在原始数据上进行重采样比在特征空间进行重采样效果更佳；大多数情况下，特定欠采样和混合方法效果有限。此外，通过探索正常数据与异常数据的重采样比例，我们建议通过过采样为少数类生成更多数据，同时通过欠采样减少多数类数据的剔除量。总之，本研究为数据重采样方法与深度学习日志异常检测之间的复杂关系提供了重要洞见。