Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce "short-score" conditioning via a beat-quantized multi-voice harmony skeleton, enabling outline control while preserving textural diversity. The model is further refined using Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward, aligning symbolic output with modern acoustic expectations. Additionally, we implement a dissonance-averse sampling algorithm to suppress unintended tonal clashes during inference. Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression. Subjective evaluations demonstrate that SymphonyGen outperforms baselines in musicality and preference for orchestral music generation. Demo page: https://symphonygen.github.io/
翻译:生成交响乐需要同时管理高层级结构形式与密集的多轨配器。现有符号模型常陷入"复杂度-可控性失衡"困境,其规模瓶颈限制了长程粒度化操控能力。本文提出SymphonyGen——面向当代电影配乐的三维分层框架。该框架采用级联解码器架构,对小节、音轨与事件三个维度进行解耦,相较传统一维或二维模型提升了计算效率与可扩展性。我们通过节拍量化的多声部和声骨架引入"短谱"条件约束,在保持织体多样性的同时实现轮廓控制。模型进一步采用组相对策略优化(GRPO)结合跨模态听觉感知奖励函数进行精调,使符号输出符合现代音响预期。此外,我们设计了避 dissonance 采样式算法,在推理过程中抑制非预期音调冲突。客观评估表明,强化学习与避 dissonance 采样均可有效提升和声纯净度,同时保持旋律表现力。主观评估显示,SymphonyGen在管弦乐生成的音乐性与偏好度方面均优于基线模型。演示页面:https://symphonygen.github.io/