Text-to-music generation has advanced rapidly, with modern autoregressive and diffusion-based models producing convincing music from natural-language prompts. However, much of this progress relies on large-scale training data and external pretraining, making it difficult to isolate which design choices remain effective when data and pretraining are controlled. We study this setting using a Diffusion Transformer backbone with lyric and timbre conditioning, adapted to an instrumental-only text-to-music task in which the auxiliary lyric and timbre branches receive only degenerate conditioning signals. Through controlled ablations, we find that models retrained without these branches score lower across AudioBox aesthetics, LLM-as-judge, and human MOS, and that reinvesting the saved parameters as additional DiT depth recovers only marginally. This suggests the auxiliary branches may act as training-time architectural anchors whose contribution goes beyond their explicit conditioning content. We validate the same model through comparisons with external instrumental baselines and through our submission to the ICME 2026 Academic Text-to-Music (ATTM) Grand Challenge, where our Performance submission ranked first under both the objective metrics and the subsequent organizer-administered MOS over 35 raters, attaining the highest overall MOS across all challenge submissions, while our Efficiency submission was a finalist that tied for second under the objective metrics.
翻译:文本到音乐生成技术发展迅速,现代自回归和基于扩散的模型能够根据自然语言提示生成令人信服的音乐。然而,这一进展在很大程度上依赖于大规模训练数据和外部预训练,使得在控制数据和预训练条件时,难以孤立地判断哪些设计选择仍然有效。我们使用带有歌词和音色条件化的扩散变换器骨干网络来研究这一设定,并将其应用于仅限乐器的文本到音乐任务中,此时辅助的歌词和音色分支仅接收退化的条件信号。通过受控消融实验,我们发现:移除这些分支重新训练的模型在AudioBox美学评分、大语言模型评判和人类平均意见分上得分均较低;而将节省的参数作为额外的DiT深度重新投入,也只能带来微弱的性能提升。这表明辅助分支可能充当训练时的架构锚点,其贡献超越了显式的条件内容。我们通过与外部乐器基线的比较以及向ICME 2026学术文本到音乐挑战赛的提交,验证了同一模型。我们的性能提交在客观指标和后续由组织者管理的超过35名评分者的平均意见分中均排名第一,在所有挑战提交中获得了最高的总体平均意见分;而我们的效率提交作为决赛入围者,在客观指标上并列第二。