A Latent Space Correlation-Aware Autoencoder for Anomaly Detection in Skewed Data

Unsupervised learning-based anomaly detection in latent space has gained importance since discriminating anomalies from normal data becomes difficult in high-dimensional space. Both density estimation and distance-based methods to detect anomalies in latent space have been explored in the past. These methods prove that retaining valuable properties of input data in latent space helps in the better reconstruction of test data. Moreover, real-world sensor data is skewed and non-Gaussian in nature, making mean-based estimators unreliable for skewed data. Again, anomaly detection methods based on reconstruction error rely on Euclidean distance, which does not consider useful correlation information in the feature space and also fails to accurately reconstruct the data when it deviates from the training distribution. In this work, we address the limitations of reconstruction error-based autoencoders and propose a kernelized autoencoder that leverages a robust form of Mahalanobis distance (MD) to measure latent dimension correlation to effectively detect both near and far anomalies. This hybrid loss is aided by the principle of maximizing the mutual information gain between the latent dimension and the high-dimensional prior data space by maximizing the entropy of the latent space while preserving useful correlation information of the original data in the low-dimensional latent space. The multi-objective function has two goals -- it measures correlation information in the latent feature space in the form of robust MD distance and simultaneously tries to preserve useful correlation information from the original data space in the latent space by maximizing mutual information between the prior and latent space.

翻译：基于无监督学习的潜在空间异常检测日益重要，因为在高维空间中区分异常数据与正常数据存在困难。已有研究探索了密度估计和基于距离的潜在空间异常检测方法。这些方法证明，在潜在空间中保留输入数据的有用属性有助于测试数据的更好重构。然而，真实世界传感器数据具有偏斜和非高斯特性，这使得基于均值的估计器对偏斜数据不可靠。此外，基于重构误差的异常检测方法依赖欧氏距离，该方法既未考虑特征空间中有用的相关性信息，也无法在数据偏离训练分布时准确重构数据。本研究针对基于重构误差的自编码器的局限性，提出一种核化自编码器，该模型利用马氏距离的稳健形式来度量潜在维度相关性，从而有效检测邻近和远端异常。这种混合损失函数基于最大化潜在维度与高维先验数据空间之间互信息增益的原理，通过最大化潜在空间熵的同时在低维潜在空间中保留原始数据的有用相关性信息来实现。该多目标函数具有双重目标——以稳健马氏距离形式度量潜在特征空间中的相关性信息，同时通过最大化先验空间与潜在空间之间的互信息，试图在潜在空间中保留原始数据空间中有用的相关性信息。