Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at https://huggingface.co/MacrOData-CMU.
翻译:摘要:高质量基准对于公平且准确地追踪科研进展、助力从业者做出明智的方法选择至关重要。表格数据上的异常检测支撑着众多实际应用,然而现有异常检测基准仍十分有限。当前文献中公认的权威基准AdBench,仅包含57个数据集。除本文讨论的其他缺陷外,其小规模严重限制了多样性和统计效能。我们提出MacrOData,一个大规模表格数据异常检测基准套件,包含三个精心构建的组成部分:OddBench(含790个具有真实世界语义异常的数据集)、OvrBench(含856个包含真实世界统计离群值的数据集)以及SynBench(含800个覆盖多样数据先验和异常原型的合成数据集)。凭借其规模和多样性,MacrOData能够实现全面且具有统计稳健性的表格数据异常检测方法评估。我们的基准进一步满足若干关键需求:为所有数据集提供标准化的训练/测试划分,设立公开/私有基准分区(后者保留测试标签用于在线排行榜),并为数据集标注语义元数据。我们在所有基准上开展广泛实验,评估涵盖经典、深度及基础模型的多类异常检测方法,并探索多样化超参数配置。我们报告了详细的实证发现、实用指南以及可作为未来研究参考的个体性能。所有2,446个数据集组合而成的基准已开源,同时提供托管于https://huggingface.co/MacrOData-CMU的公开排行榜。