At the crossway of machine learning and data analysis, anomaly detection aims at identifying observations that exhibit abnormal behaviour. Be it measurement errors, disease development, severe weather, production quality default(s) (items) or failed equipment, financial frauds or crisis events, their on-time identification and isolation constitute an important task in almost any area of industry and science. While a substantial body of literature is devoted to detection of anomalies, little attention is payed to their explanation. This is the case mostly due to intrinsically non-supervised nature of the task and non-robustness of the exploratory methods like principal component analysis (PCA). We introduce a new statistical tool dedicated for exploratory analysis of abnormal observations using data depth as a score. Anomaly component analysis (shortly ACA) is a method that searches a low-dimensional data representation that best visualises and explains anomalies. This low-dimensional representation not only allows to distinguish groups of anomalies better than the methods of the state of the art, but as well provides a -- linear in variables and thus easily interpretable -- explanation for anomalies. In a comparative simulation and real-data study, ACA also proves advantageous for anomaly analysis with respect to methods present in the literature.
翻译:在机器学习与数据分析的交叉领域,异常检测旨在识别表现出异常行为的观测值。无论是测量误差、疾病发展、极端天气、生产质量缺陷、设备故障,还是金融欺诈或危机事件,其及时识别与隔离在工业和科学的几乎所有领域都是一项重要任务。尽管已有大量文献专注于异常检测,但对异常解释的关注却很少。这主要是由于该任务本质上的无监督特性,以及主成分分析(PCA)等探索性方法缺乏鲁棒性所致。我们提出一种新的统计工具,专门用于使用数据深度作为评分来探索性分析异常观测值。异常成分分析(简称ACA)是一种方法,它搜索能够最佳可视化和解释异常的低维数据表示。这种低维表示不仅能比现有最优方法更好地区分异常组,还能提供对异常的解释——这种解释在变量上是线性的,因此易于理解。通过模拟和真实数据对比研究,ACA在异常分析方面也证明了相对于文献中现有方法的优势。