Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
翻译:歌词翻译需兼顾语义准确传递与音乐韵律、音节结构及诗歌风格的保持。在动画音乐剧中,由于需与视觉和听觉线索对齐,这一挑战尤为严峻。我们提出面向动画歌曲翻译的多语言多模态歌词基准数据集(MAVL),这是首个用于可演唱歌词翻译的多语言多模态基准。通过整合文本、音频与视频,MAVL能够比纯文本方法实现更丰富、更具表现力的翻译。在此基础上,我们进一步提出带音节约束的链式思维音视频大语言模型SylAVL-CoT,该模型利用音视频线索并施加音节约束,从而生成自然流畅的歌词。实验结果表明,SylAVL-CoT在可唱性和语境准确性上显著优于基于文本的模型,凸显了多模态、多语言方法在歌词翻译中的价值。