Music generation has progressed significantly, especially in the domain of audio generation. However, generating symbolic music that is both long-structured and expressive remains a significant challenge. In this paper, we propose PerceiverS (Segmentation and Scale), a novel architecture designed to address this issue by leveraging both Effective Segmentation and Multi-Scale attention mechanisms. Our approach enhances symbolic music generation by simultaneously learning long-term structural dependencies and short-term expressive details. By combining cross-attention and self-attention in a Multi-Scale setting, PerceiverS captures long-range musical structure while preserving performance nuances. The proposed model, evaluated on datasets like Maestro, demonstrates improvements in generating coherent and diverse music with both structural consistency and expressive variation. The project demos and the generated music samples can be accessed through the link: https://perceivers.github.io.
翻译:音乐生成领域已取得显著进展,尤其在音频生成方面。然而,生成兼具长时结构性与表现力的符号音乐仍是一项重大挑战。本文提出PerceiverS(Segmentation and Scale),这是一种新颖的架构,旨在通过利用有效分割与多尺度注意力机制来解决此问题。我们的方法通过同时学习长期结构依赖与短期表现细节来增强符号音乐生成。通过在多尺度设置中结合交叉注意力与自注意力,PerceiverS能够捕捉长程音乐结构,同时保留演奏的细微差别。在Maestro等数据集上的评估表明,所提模型在生成兼具结构一致性与表现力变化的连贯且多样化的音乐方面有所提升。项目演示及生成的音乐样本可通过链接访问:https://perceivers.github.io。