In automatic music generation, a central challenge is to design controls that enable meaningful human-machine interaction. Existing systems often rely on extrinsic inputs such as text prompts or metadata, which do not allow humans to directly shape the composition. While prior work has explored intrinsic controls such as chords or hierarchical structure, these approaches mainly address piano or vocal-accompaniment settings, leaving multitrack symbolic music largely underexplored. We identify instrumentation, the choice of instruments and their roles, as a natural dimension of control in multi-track composition, and propose ViTex, a visual representation of instrumental texture. In ViTex, color encodes instrument choice, spatial position represents pitch and time, and stroke properties capture local textures. Building on this representation, we develop a discrete diffusion model conditioned on ViTex and chord progressions to generate 8-measure multi-track symbolic music, enabling explicit texture-level control while maintaining strong unconditional generation quality. The demo page and code are avaliable at https://vitex2025.github.io/.
翻译:在自动音乐生成中,核心挑战在于设计能实现有意义人机交互的控制手段。现有系统常依赖文本提示或元数据等外部输入,这使得人类无法直接塑造音乐作品。尽管先前研究探索了和弦或层次结构等内在控制方式,但这些方法主要针对钢琴或人声伴奏场景,多轨符号音乐在很大程度上仍未被充分探索。我们提出将配器法(即乐器的选择及其角色)作为多轨创作中一个自然的控制维度,并设计了ViTex——一种乐器纹理的视觉表示。在ViTex中,颜色编码乐器选择,空间位置表示音高与时间,笔画属性则捕捉局部纹理。基于这一表示,我们开发了一个以ViTex与和弦进行为条件的离散扩散模型,用于生成8小节多轨符号音乐,在保持强无条件生成质量的同时,实现显式的纹理级控制。演示页面与代码可从https://vitex2025.github.io/获取。