Along with the nearing completion of the Square Kilometre Array (SKA), comes an increasing demand for accurate and reliable automated solutions to extract valuable information from the vast amount of data it will allow acquiring. Automated source finding is a particularly important task in this context, as it enables the detection and classification of astronomical objects. Deep-learning-based object detection and semantic segmentation models have proven to be suitable for this purpose. However, training such deep networks requires a high volume of labeled data, which is not trivial to obtain in the context of radio astronomy. Since data needs to be manually labeled by experts, this process is not scalable to large dataset sizes, limiting the possibilities of leveraging deep networks to address several tasks. In this work, we propose RADiff, a generative approach based on conditional diffusion models trained over an annotated radio dataset to generate synthetic images, containing radio sources of different morphologies, to augment existing datasets and reduce the problems caused by class imbalances. We also show that it is possible to generate fully-synthetic image-annotation pairs to automatically augment any annotated dataset. We evaluate the effectiveness of this approach by training a semantic segmentation model on a real dataset augmented in two ways: 1) using synthetic images obtained from real masks, and 2) generating images from synthetic semantic masks. We show an improvement in performance when applying augmentation, gaining up to 18% in performance when using real masks and 4% when augmenting with synthetic masks. Finally, we employ this model to generate large-scale radio maps with the objective of simulating Data Challenges.
翻译:随着平方公里阵列(SKA)即将建成,人们对准确可靠的自动化解决方案的需求日益增长,以便从它将获取的海量数据中提取有价值信息。在此背景下,自动源检测尤为重要,因为它能够实现天体的检测与分类。基于深度学习的物体检测和语义分割模型已被证明适用于此目的。然而,训练此类深度网络需要大量标注数据,这在射电天文学中并非易事。由于数据需要专家手动标注,这一过程难以扩展至大规模数据集,限制了利用深度网络解决多项任务的可能性。在本工作中,我们提出RADiff,一种基于条件扩散模型的生成方法,该模型在标注的射电数据集上训练,用于生成包含不同形态射电源的合成图像,以扩充现有数据集并缓解类别不平衡问题。我们还展示了通过生成全合成图像-标注对来自动扩充任意标注数据集的可能性。我们通过在真实数据集上训练语义分割模型来评估该方法的有效性,该数据集通过两种方式扩充:1)使用真实掩模生成的合成图像,2)从合成语义掩模生成图像。结果表明,应用数据扩充后性能有所提升:使用真实掩模时性能提升高达18%,使用合成掩模时提升4%。最后,我们利用该模型生成大规模射电图像,用于模拟数据挑战。