Can AI Dream of Unseen Galaxies? Conditional Diffusion Model for Galaxy Morphology Augmentation

from arxiv, 29 pages, 17 figures, accepted version for ApJS. Comments welcome. See another independent work for further reference, Category-based Galaxy Image Generation via Diffusion Models (Fan, Tang et al.)

Observational astronomy relies on visual feature identification to detect critical astrophysical phenomena. While machine learning (ML) increasingly automates this process, models often struggle with generalization in large-scale surveys due to the limited representativeness of labeled datasets, whether from simulations or human annotation, a challenge pronounced for rare yet scientifically valuable objects. To address this, we propose a conditional diffusion model to synthesize realistic galaxy images for augmenting ML training data (hereafter GalaxySD). Leveraging the Galaxy Zoo 2 dataset which contains visual feature, galaxy image pairs from volunteer annotation, we demonstrate that GalaxySD generates diverse, high-fidelity galaxy images that closely adhere to the specified morphological feature conditions. Moreover, this model enables generative extrapolation to project well-annotated data into unseen domains and advancing rare object detection. Integrating synthesized images into ML pipelines improves performance in standard morphology classification, boosting completeness and purity by up to 30% across key metrics. For rare object detection, using early-type galaxies with prominent dust lane features (~0.1% in GZ2 dataset) as a test case, our approach doubled the number of detected instances, from 352 to 872, compared to previous studies based on visual inspection. This study highlights the power of generative models to bridge gaps between scarce labeled data and the vast, uncharted parameter space of observational astronomy and sheds insight for future astrophysical foundation model developments. Our project homepage is available at https://galaxysd-webpage.streamlit.app/.

翻译：观测天文学依赖视觉特征识别来探测关键的天体物理现象。尽管机器学习（ML）日益自动化这一过程，但由于标记数据集（无论是来自模拟还是人工标注）的代表性有限，模型在大规模巡天中常面临泛化困难，这一挑战对于罕见但具有科学价值的天体尤为突出。为解决此问题，我们提出一种条件扩散模型来合成逼真的星系图像以增强ML训练数据（下文称GalaxySD）。基于包含志愿者标注的视觉特征-星系图像对的Galaxy Zoo 2数据集，我们证明GalaxySD能生成多样化、高保真度的星系图像，并严格遵循指定的形态特征条件。此外，该模型支持生成式外推，将标注良好的数据投影到未观测域，从而推进罕见天体检测。将合成图像整合到ML流程中，可提升标准形态分类任务的性能，使关键指标的完备性与纯度提升最高达30%。以具有显著尘埃带特征的早型星系（约占GZ2数据集的0.1%）作为罕见天体检测的测试案例，我们的方法将检测实例数量从先前基于人工目视研究的352个提升至872个，实现翻倍增长。本研究揭示了生成模型在弥合稀缺标注数据与观测天文学广阔未知参数空间之间鸿沟的潜力，并为未来天体物理基础模型的发展提供了启示。项目主页位于https://galaxysd-webpage.streamlit.app/。