As Joint Audio-Visual Generation Models see widespread commercial deployment, embedding watermarks has become essential for protecting vendor copyright and ensuring content provenance. However, existing techniques suffer from an architectural mismatch by treating modalities as decoupled entities, exposing a critical Binding Vulnerability. Adversaries exploit this via Swap Attacks by replacing authentic audio with malicious deepfakes while retaining the watermarked video. Because current detectors rely on independent verification ($Video_{wm}\vee Audio_{wm}$), they incorrectly authenticate the manipulated content, falsely attributing harmful media to the original vendor and severely damaging their reputation. To address this, we propose mAVE (Manifold Audio-Visual Entanglement), the first watermarking framework natively designed for joint architectures. mAVE cryptographically binds audio and video latents at initialization without fine-tuning, defining a Legitimate Entanglement Manifold via Inverse Transform Sampling. Experiments on state-of-the-art models (LTX-2, MOVA) demonstrate that mAVE guarantees performance-losslessness and provides an exponential security bound against Swap Attacks. Achieving near-perfect binding integrity ($>99\%$), mAVE offers a robust cryptographic defense for vendor copyright.
翻译:随着联合视听生成模型在商业领域的广泛应用,嵌入水印已成为保护供应商版权和确保内容溯源的关键手段。然而,现有技术因将不同模态视为解耦实体而存在架构失配问题,暴露出严重的绑定脆弱性。攻击者通过交换攻击利用此漏洞,在保留带水印视频的同时替换原始音频为恶意深度伪造内容。由于当前检测器依赖独立验证机制($Video_{wm}\vee Audio_{wm}$),其错误认证被篡改内容,将有害媒体误归于原始供应商,严重损害其声誉。为解决此问题,我们提出mAVE(流形视听纠缠框架),这是首个专为联合架构原生设计的水印方案。mAVE通过初始化阶段的密码学绑定实现音频与视频潜在表征的纠缠,无需微调即可通过逆变换采样定义合法纠缠流形。在先进模型(LTX-2、MOVA)上的实验表明,mAVE在保证性能无损的同时,针对交换攻击提供指数级安全边界。该框架实现了近乎完美的绑定完整性($>99\%$),为供应商版权保护提供了强大的密码学防御机制。