Anomaly detection is a branch of data analysis and machine learning which aims at identifying observations that exhibit abnormal behaviour. Be it measurement errors, disease development, severe weather, production quality default(s) (items) or failed equipment, financial frauds or crisis events, their on-time identification, isolation and explanation constitute an important task in almost any branch of science and industry. By providing a robust ordering, data depth - statistical function that measures belongingness of any point of the space to a data set - becomes a particularly useful tool for detection of anomalies. Already known for its theoretical properties, data depth has undergone substantial computational developments in the last decade and particularly recent years, which has made it applicable for contemporary-sized problems of data analysis and machine learning. In this article, data depth is studied as an efficient anomaly detection tool, assigning abnormality labels to observations with lower depth values, in a multivariate setting. Practical questions of necessity and reasonability of invariances and shape of the depth function, its robustness and computational complexity, choice of the threshold are discussed. Illustrations include use-cases that underline advantageous behaviour of data depth in various settings.
翻译:异常检测是数据分析和机器学习的一个分支,旨在识别表现出异常行为的观测值。无论是测量误差、疾病发展、恶劣天气、生产质量缺陷(项目)或设备故障、金融欺诈还是危机事件,其及时识别、隔离和解释构成了几乎所有科学与工业领域的重要任务。通过提供稳健的排序,数据深度——一种衡量空间任意点对数据集归属程度的统计函数——成为异常检测的特别有用工具。数据深度以其理论性质而闻名,在过去十年特别是近年来经历了实质性的计算发展,使其能够适用于当代规模的数据分析和机器学习问题。本文研究数据深度作为一种高效的异常检测工具,在多元设定中为深度值较低的观测值分配异常标签。文中讨论了深度函数不变性与形状的必要性与合理性、其稳健性与计算复杂性、阈值选择等实际问题。示例包括突显数据深度在不同场景下优势行为的应用案例。