Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the intricate interplay between the two. However, a comprehensive understanding of music necessitates the integration of both these elements. In this study, we delve into this overlooked realm by introducing a method to systematically learn multimodal alignment between audio and lyrics through contrastive learning. This not only recognizes and emphasizes the synergy between audio and lyrics but also paves the way for models to achieve deeper cross-modal coherence, thereby producing high-quality captions. We provide both theoretical and empirical results demonstrating the advantage of the proposed method, which achieves new state-of-the-art on two music captioning datasets.
翻译:音乐字幕生成在流媒体平台日益普及的背景下受到了广泛关注。传统方法通常侧重于音乐的音频或歌词单一方面,无意中忽视了两者之间复杂的交互关系。然而,对音乐的全面理解需要整合这两种元素。在本研究中,我们通过引入一种方法,利用对比学习系统性地学习音频与歌词之间的多模态对齐,从而深入探索了这一被忽视的领域。这不仅识别并强调了音频与歌词之间的协同作用,还为模型实现更深层次的跨模态一致性铺平了道路,进而生成高质量的字幕。我们提供了理论和实证结果,证明了所提出方法的优势,该方法在两个音乐字幕数据集上取得了新的最先进成果。