AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
翻译:AI生成视频已彻底改变了短视频制作、电影摄制和个性化媒体领域,使得视频局部编辑成为一种必不可少的工具。然而,这一进展也模糊了现实与虚构之间的界限,给多媒体取证带来了挑战。为解决这一紧迫问题,本文提出了V2A-Mark,以应对当前视频篡改取证方法存在的泛化能力差、功能单一和模态关注局限等问题。我们的方法结合了视频隐写术的脆弱性与深度鲁棒水印技术,能够将不可见的视觉-音频定位水印与版权水印嵌入原始视频帧和音频中,从而实现精确的篡改定位与版权保护。我们还设计了一个时序对齐与融合模块以及退化提示学习,以提升定位精度和解码鲁棒性。同时,我们引入了一种样本级音频定位方法和一种跨模态版权提取机制,以耦合音频与视频帧的信息。V2A-Mark的有效性已在一个视觉-音频篡改数据集上得到验证,突显了其在定位精度和版权准确性方面的优越性,这对于AIGC视频时代视频编辑的可持续发展至关重要。