关注旋律：面向单编码器旋律和声的课程掩码训练 (Pay (Cross) Attention to the Melody: Curriculum Masking for Single-Encoder Melodic Harmonization)

Melodic harmonization, the task of generating harmonic accompaniments for a given melody, remains a central challenge in computational music generation. Recent single encoder transformer approaches have framed harmonization as a masked sequence modeling problem, but existing training curricula inspired by discrete diffusion often result in weak (cross) attention between melody and harmony. This leads to limited exploitation of melodic cues, particularly in out-of-domain contexts. In this work, we introduce a training curriculum, FF (full-to-full), which keeps all harmony tokens masked for several training steps before progressively unmasking entire sequences during training to strengthen melody-harmony interactions. We systematically evaluate this approach against prior curricula across multiple experimental axes, including temporal quantization (quarter vs. sixteenth note), bar-level vs. time-signature conditioning, melody representation (full range vs. pitch class), and inference-time unmasking strategies. Models are trained on the HookTheory dataset and evaluated both in-domain and on a curated collection of jazz standards, using a comprehensive set of metrics that assess chord progression structure, harmony-melody alignment, and rhythmic coherence. Results demonstrate that the proposed FF curriculum consistently outperforms baselines in nearly all metrics, with particularly strong gains in out-of-domain evaluations where harmonic adaptability to novel melodic queues is crucial. We further find that quarter-note quantization, intertwining of bar tokens, and pitch-class melody representations are advantageous in the FF setting. Our findings highlight the importance of training curricula in enabling effective melody conditioning and suggest that full-to-full unmasking offers a robust strategy for single encoder harmonization.

翻译：旋律和声化（为给定旋律生成和声伴奏）仍然是计算音乐生成领域的核心挑战。近期基于单编码器Transformer的方法将和声化构建为掩码序列建模问题，但受离散扩散启发的现有训练课程常导致旋律与和声间的（交叉）注意力机制薄弱。这限制了旋律线索的利用，尤其在域外情境中更为明显。本研究提出一种名为FF（全掩码到全可见）的训练课程，该课程在训练初期保持所有和声标记处于掩码状态，随后在训练过程中逐步解掩整个序列，以强化旋律-和声交互作用。我们通过多维度实验系统评估该方法，包括时间量化（四分音符与十六分音符）、小节级与时值条件化、旋律表示（全音域与音级表示）以及推理时解掩策略。模型在HookTheory数据集上训练，并在域内及精选的爵士标准曲集上进行评估，采用综合指标评估和弦进行结构、和声-旋律对齐度及节奏连贯性。结果表明，所提出的FF课程在几乎所有指标上均稳定优于基线方法，在域外评估中提升尤为显著——该场景中对新颖旋律线索的和声适应能力至关重要。我们进一步发现，在FF框架下，四分音符量化、小节标记的交织处理以及音级旋律表示具有优势。本研究结果凸显了训练课程在实现有效旋律条件化中的重要性，并表明全掩码到全可见的解掩策略为单编码器和声化提供了稳健的解决方案。