Anomaly detection in large datasets is essential in astronomy and computer vision. However, due to a scarcity of labelled data, it is often infeasible to apply supervised methods to anomaly detection. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. AnomalyMatch is tailored for large-scale applications and integrated into the ESA Datalabs science platform. In this method, we treat anomaly detection as a binary classification problem and efficiently utilise limited labelled and abundant unlabelled images for training. We enable active learning via a user interface for verification of high-confidence anomalies and correction of false positives. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance display strong performance. Starting from five to ten labelled anomalies, we achieve an average AUROC of 0.96 (miniImageNet) and 0.89 (GalaxyMNIST), with respective AUPRC of 0.82 and 0.77. After three active learning cycles, anomalies are ranked with 76% (miniImageNet) to 94% (GalaxyMNIST) precision in the top 1% of the highest-ranking images by score. We compare to the established Astronomaly software on selected 'odd' galaxies from the 'Galaxy Zoo- The Galaxy Challenge' dataset, achieving comparable performance with an average AUROC of 0.83. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity
翻译:大型数据集中的异常检测在天文学与计算机视觉领域至关重要。然而,由于标注数据稀缺,监督方法在异常检测中往往难以应用。我们提出AnomalyMatch框架,该框架将基于EfficientNet分类器的半监督FixMatch算法与主动学习相结合。AnomalyMatch专为大规模应用设计,并集成于ESA Datalabs科学平台。该方法将异常检测视为二分类问题,有效利用有限标注图像与大量未标注图像进行训练。我们通过用户界面实现主动学习,用于验证高置信度异常并纠正误报。在严重类别不平衡条件下,基于GalaxyMNIST天文数据集与miniImageNet自然图像基准的评估展现了强劲性能。从五到十个标注异常样本出发,我们分别在miniImageNet与GalaxyMNIST上取得平均AUROC 0.96和0.89,对应AUPRC为0.82和0.77。经过三轮主动学习循环后,在得分最高的前1%图像中,异常排名精确度达到76%(miniImageNet)至94%(GalaxyMNIST)。我们将结果与既有Astronomaly软件在“Galaxy Zoo- The Galaxy Challenge”数据集中选定的“奇特”星系进行比较,以平均AUROC 0.83取得可比性能。实验结果凸显了该方法在异常发现中的卓越实用性与可扩展性,并强调了针对严重标签稀缺领域采用专门化方法的重要价值。