AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
翻译:AI生成视频已彻底革新短视频制作、电影制作及个性化媒体领域,使视频局部编辑成为关键工具。然而,这一进步同时也模糊了现实与虚构的界限,给多媒体取证带来了严峻挑战。为解决这一紧迫问题,本文提出V2A-Mark,以突破当前视频篡改取证存在的泛化性差、功能单一、模态单一等局限。通过将视频到视频的脆弱隐写术与深度鲁棒水印技术相结合,该方法可将不可见的视觉-音频定位水印及版权水印嵌入原始视频帧与音频中,实现精确的篡改定位与版权保护。我们还设计了时间对齐与融合模块及退化提示学习机制,以提升定位精度与解码鲁棒性。同时,引入了样本级音频定位方法与跨模态版权提取机制,用于耦合音频与视频帧的信息。V2A-Mark的有效性已在视觉-音频篡改数据集上得到验证,其在定位精度与版权准确性方面的优越性,对AIGC视频时代中视频编辑的可持续发展具有重要意义。