Recent years have witnessed significant progress in generative models for music, featuring diverse architectures that balance output quality, diversity, speed, and user control. This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images. To enhance note generation in specified areas, masked regions can be "repainted" with extra noise. The non-latent HDiTs linear scaling with pixel count allows efficient generation in pixel space, providing intuitive and interpretable controls such as masking throughout the network and removing the need to operate in compressed latent spaces such as those provided by pretrained autoencoders. We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help increase note density yielding musical structures closely matching user specifications such as rising, falling, or diverging melody and/or accompaniment, even when these lie outside the typical training data distribution. We achieve performance on par with prior results while operating at longer context windows, with no autoencoder, and can enable complex geometries for inpainting masks, increasing the options for machine-assisted composers to control the generated music.
翻译:近年来,音乐生成模型取得了显著进展,其多样化的架构在输出质量、多样性、速度与用户控制之间实现了平衡。本研究探索了一种用户友好的图形界面,允许用户绘制掩码区域,供在MIDI钢琴卷帘图像上训练的Hourglass Diffusion Transformer(HDiT)模型进行修复。为增强指定区域的音符生成效果,掩码区域可通过添加额外噪声进行“重绘”。非潜空间的HDiT模型因其像素数量的线性缩放特性,可在像素空间中实现高效生成,提供直观且可解释的控制方式(例如在网络中全程实施掩码操作),无需在预训练自编码器提供的压缩潜空间中进行运算。我们证明,除了对旋律、伴奏及延续段进行修复外,重绘技术的运用有助于提升音符密度,生成与用户指定要求(如上行、下行或分声部旋律/伴奏)高度匹配的音乐结构,即使这些要求超出典型训练数据分布范围。我们在更长上下文窗口下实现了与先前研究相当的性能,且无需自编码器,并能支持复杂几何形状的修复掩码,从而为机器辅助作曲者提供了更丰富的生成音乐控制选项。