AnomalyMatch: Discovering Rare Objects of Interest with Semi-supervised and Active Learning

Anomaly detection in large datasets is essential in astronomy and computer vision. However, due to a scarcity of labelled data, it is often infeasible to apply supervised methods to anomaly detection. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. AnomalyMatch is tailored for large-scale applications and integrated into the ESA Datalabs science platform. In this method, we treat anomaly detection as a binary classification problem and efficiently utilise limited labelled and abundant unlabelled images for training. We enable active learning via a user interface for verification of high-confidence anomalies and correction of false positives. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance display strong performance. Starting from five to ten labelled anomalies, we achieve an average AUROC of 0.96 (miniImageNet) and 0.89 (GalaxyMNIST), with respective AUPRC of 0.82 and 0.77. After three active learning cycles, anomalies are ranked with 76% (miniImageNet) to 94% (GalaxyMNIST) precision in the top 1% of the highest-ranking images by score. We compare to the established Astronomaly software on selected 'odd' galaxies from the 'Galaxy Zoo- The Galaxy Challenge' dataset, achieving comparable performance with an average AUROC of 0.83. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity

翻译：大型数据集中的异常检测在天文学与计算机视觉领域至关重要。然而，由于标注数据稀缺，监督方法在异常检测中往往难以应用。我们提出AnomalyMatch框架，该框架将基于EfficientNet分类器的半监督FixMatch算法与主动学习相结合。AnomalyMatch专为大规模应用设计，并集成于ESA Datalabs科学平台。该方法将异常检测视为二分类问题，有效利用有限标注图像与大量未标注图像进行训练。我们通过用户界面实现主动学习，用于验证高置信度异常并纠正误报。在严重类别不平衡条件下，基于GalaxyMNIST天文数据集与miniImageNet自然图像基准的评估展现了强劲性能。从五到十个标注异常样本出发，我们分别在miniImageNet与GalaxyMNIST上取得平均AUROC 0.96和0.89，对应AUPRC为0.82和0.77。经过三轮主动学习循环后，在得分最高的前1%图像中，异常排名精确度达到76%（miniImageNet）至94%（GalaxyMNIST）。我们将结果与既有Astronomaly软件在“Galaxy Zoo- The Galaxy Challenge”数据集中选定的“奇特”星系进行比较，以平均AUROC 0.83取得可比性能。实验结果凸显了该方法在异常发现中的卓越实用性与可扩展性，并强调了针对严重标签稀缺领域采用专门化方法的重要价值。

相关内容

主动学习

关注 243

主动学习是机器学习（更普遍的说是人工智能）的一个子领域，在统计学领域也叫查询学习、最优实验设计。“学习模块”和“选择策略”是主动学习算法的2个基本且重要的模块。主动学习是“一种学习方法，在这种方法中，学生会主动或体验性地参与学习过程，并且根据学生的参与程度，有不同程度的主动学习。” （Bonwell＆Eison 1991）Bonwell＆Eison（1991）指出：“学生除了被动地听课以外，还从事其他活动。” 在高等教育研究协会（ASHE）的一份报告中，作者讨论了各种促进主动学习的方法。他们引用了一些文献，这些文献表明学生不仅要做听，还必须做更多的事情才能学习。他们必须阅读，写作，讨论并参与解决问题。此过程涉及三个学习领域，即知识，技能和态度（KSA）。这种学习行为分类法可以被认为是“学习过程的目标”。特别是，学生必须从事诸如分析，综合和评估之类的高级思维任务。

《视觉Transformers自监督学习机制综述》

专知会员服务

29+阅读 · 2024年9月2日

【牛津大学博士论文】探索用于半监督学习的概率模型，127页pdf

专知会员服务

27+阅读 · 2024年6月15日

【牛津大学博士论文】探索半监督学习的概率模型，127页pdf

专知会员服务

40+阅读 · 2024年4月8日