Machine learning (ML) is becoming a critical tool for interrogation of large complex data. Labeling, defined as the process of adding meaningful annotations, is a crucial step of supervised ML. However, labeling datasets is time consuming. Here we show that convolutional neural networks (CNNs), trained on crudely labeled astronomical videos, can be leveraged to improve the quality of data labeling and reduce the need for human intervention. We use videos of the solar magnetic field, crudely labeled into two classes: emergence or non-emergence of bipolar magnetic regions (BMRs), based on their first detection on the solar disk. We train CNNs using crude labels, manually verify, correct labeling vs. CNN disagreements, and repeat this process until convergence. Traditionally, flux emergence labelling is done manually. We find that a high-quality labeled dataset, derived through this iterative process, reduces the necessary manual verification by 50%. Furthermore, by gradually masking the videos and looking for maximum change in CNN inference, we locate BMR emergence time without retraining the CNN. This demonstrates the versatility of CNNs for simplifying the challenging task of labeling complex dynamic events.
翻译:机器学习(ML)已成为分析大规模复杂数据的关键工具。标注——即添加有意义的注释的过程——是监督式ML的关键步骤。然而,数据集标注非常耗时。本文表明,基于粗略标注的天文视频训练的卷积神经网络(CNN)能够提升数据标注质量并减少人工干预需求。我们使用太阳磁场视频,依据双极磁区(BMR)在太阳盘面首次探测到的特征,将其粗略标注为两类:浮现或未浮现。我们利用粗略标签训练CNN,人工验证并修正CNN与标签间的分歧,重复此过程直至收敛。传统上,磁通浮现标注依赖人工操作。研究发现,通过这种迭代方法获得的高质量标注数据集可减少50%的必要人工验证。此外,通过逐步遮蔽视频并寻找CNN推断结果的最大变化点,我们无需重新训练CNN即可定位BMR的浮现时间。这证明了CNN在简化复杂动态事件标注这一挑战性任务中的多功能性。