The impact of outliers and anomalies on model estimation and data processing is of paramount importance, as evidenced by the extensive body of research spanning various fields over several decades: thousands of research papers have been published on the subject. As a consequence, numerous reviews, surveys, and textbooks have sought to summarize the existing literature, encompassing a wide range of methods from both the statistical and data mining communities. While these endeavors to organize and summarize the research are invaluable, they face inherent challenges due to the pervasive nature of outliers and anomalies in all data-intensive applications, irrespective of the specific application field or scientific discipline. As a result, the resulting collection of papers remains voluminous and somewhat heterogeneous. To address the need for knowledge organization in this domain, this paper implements the first systematic meta-survey of general surveys and reviews on outlier and anomaly detection. Employing a classical systematic survey approach, the study collects nearly 500 papers using two specialized scientific search engines. From this comprehensive collection, a subset of 56 papers that claim to be general surveys on outlier detection is selected using a snowball search technique to enhance field coverage. A meticulous quality assessment phase further refines the selection to a subset of 25 high-quality general surveys. Using this curated collection, the paper investigates the evolution of the outlier detection field over a 20-year period, revealing emerging themes and methods. Furthermore, an analysis of the surveys sheds light on the survey writing practices adopted by scholars from different communities who have contributed to this field. Finally, the paper delves into several topics where consensus has emerged from the literature. These include taxonomies of outlier types, challenges posed by high-dimensional data, the importance of anomaly scores, the impact of learning conditions, difficulties in benchmarking, and the significance of neural networks. Non-consensual aspects are also discussed, particularly the distinction between local and global outliers and the challenges in organizing detection methods into meaningful taxonomies.
翻译:异常值对模型估计与数据处理的影响至关重要,这已通过数十年来跨领域的大量研究得到证实——该领域已发表数千篇研究论文。为此,大量综述、评述和教科书试图总结现有文献,涵盖统计学与数据挖掘领域的方法。尽管这些组织与总结研究的努力极具价值,但由于异常值普遍存在于所有数据密集型应用中(无论具体领域或学科),相关工作仍面临固有挑战。因此,已发表的文献集仍然数量庞大且存在异质性。为满足该领域知识组织的需求,本文首次对异常值检测的通用综述与评述进行系统性元分析。采用经典系统性综述方法,通过两个专业科学搜索引擎收集近500篇论文。在此基础上,运用滚雪球搜索技术从该综合文献集中筛选出56篇声称是异常值检测通用综述的论文,以增强领域覆盖性。经过严格的质量评估,最终精选出25篇高质量通用综述。利用这一精选文献集,本文考察了异常值检测领域20年来的发展演变过程,揭示了新兴主题与方法。此外,通过对这些综述的分析,揭示了来自不同学术社群、为该领域做出贡献的学者所采用的综述写作实践。最后,本文深入探讨了文献中已达成共识的若干议题,包括异常值类型分类、高维数据带来的挑战、异常评分的重要性、学习条件的影响、基准测试的难点以及神经网络的重要性。同时,还讨论了未达成共识的方面,特别是局部异常值与全局异常值的区分,以及将检测方法组织成有意义分类体系的挑战。