Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.
翻译:在大型科学数据集中进行新颖性检测面临两个关键挑战:实验数据的噪声和高维特性,以及必须对任何观测到的异常值做出统计上稳健的论断。尽管存在大量通过降维进行异常检测的文献,但大多数方法产生的输出无法与可量化的科学发现论断相兼容。在本工作中,我们直接应对这些挑战,提出了一个适应科学严谨统计需求的统一新颖性检测流程的第一步。我们介绍了AutoSciDACT(基于异常对比测试的自动化科学发现),这是一个用于检测科学数据中新奇现象的通用流程。AutoSciDACT首先通过对比预训练创建富有表现力的低维数据表示,利用许多科学领域中高质量模拟数据的丰富性,以及能够指导原则性数据增强策略的专业知识。这些紧凑的嵌入随后支持使用新物理学习机(NPLM)框架进行极其敏感的基于机器学习的双样本检验,该框架能够识别并统计量化观测数据相对于参考分布(零假设)的偏差。我们在天文、物理、生物、图像以及合成数据集上进行了广泛的实验,证明了该方法对所有领域中注入的少量异常数据均具有强大的敏感性。