Dance-to-Music Generation with Encoder-based Textual Inversion of Diffusion Models

The harmonious integration of music with dance movements is pivotal in vividly conveying the artistic essence of dance. This alignment also significantly elevates the immersive quality of gaming experiences and animation productions. While there has been remarkable advancement in creating high-fidelity music from textual descriptions, current methodologies mainly concentrate on modulating overarching characteristics such as genre and emotional tone. They often overlook the nuanced management of temporal rhythm, which is indispensable in crafting music for dance, since it intricately aligns the musical beats with the dancers' movements. Recognizing this gap, we propose an encoder-based textual inversion technique for augmenting text-to-music models with visual control, facilitating personalized music generation. Specifically, we develop dual-path rhythm-genre inversion to effectively integrate the rhythm and genre of a dance motion sequence into the textual space of a text-to-music model. Contrary to the classical textual inversion method, which directly updates text embeddings to reconstruct a single target object, our approach utilizes separate rhythm and genre encoders to obtain text embeddings for two pseudo-words, adapting to the varying rhythms and genres. To achieve a more accurate evaluation, we propose improved evaluation metrics for rhythm alignment. We demonstrate that our approach outperforms state-of-the-art methods across multiple evaluation metrics. Furthermore, our method seamlessly adapts to in-the-wild data and effectively integrates with the inherent text-guided generation capability of the pre-trained model. Samples are available at \url{https://youtu.be/D7XDwtH1YwE}.

翻译：音乐与舞蹈动作的和谐融合对于生动传达舞蹈的艺术精髓至关重要，这一对齐也显著提升了游戏体验和动画制作的沉浸感。尽管从文本描述生成高保真音乐方面取得了显著进展，当前方法主要侧重于调节整体特征（如风格和情绪基调），却常忽略时间节奏的精细控制——而节奏管理正是舞蹈音乐创作中不可或缺的环节，因为它需将音乐节拍与舞者动作精准同步。针对这一不足，我们提出一种基于编码器的文本反转技术，为文本到音乐模型赋予视觉控制能力，从而促进个性化音乐生成。具体而言，我们开发了双路径节奏-风格反转模块，将舞蹈动作序列的节奏与风格有效集成到文本到音乐模型的文本空间中。与直接更新文本嵌入以重建单一目标对象的经典文本反转方法不同，我们的方法利用独立的节奏编码器和风格编码器获取两个伪词的文本嵌入，以适应变化的节奏与风格。为实现更准确的评估，我们提出了改进的节奏对齐评估指标。实验证明，我们的方法在多项评估指标上均优于现有最优方法。此外，该方法能无缝适配野外数据，并有效结合预训练模型固有的文本引导生成能力。示例见网址：\url{https://youtu.be/D7XDwtH1YwE}。