Bob的彩屑：音乐与视频生成中的语音记忆攻击 (Bob's Confetti: Phonetic Memorization Attacks in Music and Video Generation)

Generative AI systems for music and video commonly use text-based filters to prevent the regurgitation of copyrighted material. We expose a fundamental flaw in this approach by introducing Adversarial PhoneTic Prompting (APT), a novel attack that bypasses these safeguards by exploiting phonetic memorization. The APT attack replaces iconic lyrics with homophonic but semantically unrelated alternatives (e.g., "mom's spaghetti" becomes "Bob's confetti"), preserving acoustic structure while altering meaning; we identify high-fidelity phonetic matches using CMU pronouncing dictionary. We demonstrate that leading Lyrics-to-Song (L2S) models like SUNO and YuE regenerate songs with striking melodic and rhythmic similarity to their copyrighted originals when prompted with these altered lyrics. More surprisingly, this vulnerability extends across modalities. When prompted with phonetically modified lyrics from a song, a Text-to-Video (T2V) model like Veo 3 reconstructs visual scenes from the original music video-including specific settings and character archetypes-despite the absence of any visual cues in the prompt. Our findings reveal that models memorize deep, structural patterns tied to acoustics, not just verbatim text. This phonetic-to-visual leakage represents a critical vulnerability in transcript-conditioned generative models, rendering simple copyright filters ineffective and raising urgent concerns about the secure deployment of multimodal AI systems. Demo examples are available at our project page (https://jrohsc.github.io/music_attack/).

翻译：音乐和视频生成式人工智能系统通常采用基于文本的过滤器来防止受版权保护内容的直接复现。本文通过引入对抗性语音提示攻击（APT），揭示该方法存在根本性缺陷。APT是一种新型攻击手段，通过利用语音记忆机制绕过安全防护：将标志性歌词替换为同音异义但语义无关的替代文本（例如将“mom's spaghetti”改为“Bob's confetti”），在保持声学结构的同时改变语义内容；我们使用CMU发音词典识别高保真度的语音匹配。实验表明，当输入经过篡改的歌词时，SUNO、YuE等主流歌词转歌曲（L2S）模型生成的歌曲在旋律与节奏维度上与受版权保护的原版作品呈现惊人的相似性。更令人惊讶的是，这种漏洞具有跨模态传导性：当向Veo 3等文本转视频（T2V）模型输入经过语音篡改的歌词时，模型竟能重建原版音乐视频中的视觉场景——包括特定场景设置与角色原型——尽管提示词中未包含任何视觉线索。我们的研究揭示：模型记忆的是与声学特征绑定的深层结构模式，而非单纯的字面文本。这种从语音到视觉的信息泄漏现象，暴露了转录本条件生成模型的关键脆弱性，使得简易版权过滤器失效，并为多模态人工智能系统的安全部署敲响警钟。演示案例详见项目页面（https://jrohsc.github.io/music_attack/）。