This paper presents an overview and the technical framework of the ICME 2026 Grand Challenge on Academic Text-to-Music Generation (ATTM). Despite the rapid progress in text-to-music generation (TTM) systems, the field is currently dominated by models trained on massive proprietary datasets with industrial-scale computational resources, creating a significant barrier for academic research. To address this, the ATTM Challenge establishes a fair-play benchmark that requires participants to train generative models strictly from scratch using a standardized, CC-licensed subset of the MTG-Jamendo dataset containing only instrumental music. The challenge is divided into two tracks: the Efficiency Track (limited to 500M parameters) and the Performance Track (no parameter limit). Submissions are evaluated through a multi-stage process involving objective metrics, including Frechet Audio Distance, CLAP score, and a novel Concept Coverage Score (CCS), followed by a subjective listening test. By providing open-source baselines, preprocessing pipelines, reference captions, and public evaluation code for computing FAD and CLAP, this challenge aims to facilitate and promote TTM research in academic contexts.
翻译:本文概述了ICME 2026学术文本到音乐生成(ATTM)大挑战的技术框架。尽管文本到音乐生成(TTM)系统取得了快速进展,但该领域目前主要由基于大规模专有数据集、利用工业级计算资源训练的模型主导,这为学术研究设置了显著障碍。为此,ATTM挑战赛建立了一个公平基准,要求参赛者严格从零开始训练生成模型,使用MTG-Jamendo数据集中标准化、采用CC许可的仅含器乐的子集。挑战赛分为两个赛道:效率赛道(参数限制为5亿)和性能赛道(无参数限制)。参赛作品通过多阶段流程进行评估,涉及客观指标(包括弗雷歇音频距离、CLAP评分以及一种新颖的概念覆盖评分,简称CCS),随后进行主观听力测试。通过提供开源基线模型、预处理流程、参考描述以及用于计算FAD和CLAP的公共评估代码,本次挑战赛旨在促进和推动学术背景下的TTM研究。