ItoV: Efficiently Adapting Deep Learning-based Image Watermarking to Video Watermarking

Robust watermarking tries to conceal information within a cover image/video imperceptibly that is resistant to various distortions. Recently, deep learning-based approaches for image watermarking have made significant advancements in robustness and invisibility. However, few studies focused on video watermarking using deep neural networks due to the high complexity and computational costs. Our paper aims to answer this research question: Can well-designed deep learning-based image watermarking be efficiently adapted to video watermarking? Our answer is positive. First, we revisit the workflow of deep learning-based watermarking methods that leads to a critical insight: temporal information in the video may be essential for general computer vision tasks but not for specific video watermarking. Inspired by this insight, we propose a method named ItoV for efficiently adapting deep learning-based Image watermarking to Video watermarking. Specifically, ItoV merges the temporal dimension of the video with the channel dimension to enable deep neural networks to treat videos as images. We further explore the effects of different convolutional blocks in video watermarking. We find that spatial convolution is the primary influential component in video watermarking and depthwise convolutions significantly reduce computational cost with negligible impact on performance. In addition, we propose a new frame loss to constrain that the watermark intensity in each video clip frame is consistent, significantly improving the invisibility. Extensive experiments show the superior performance of the adapted video watermarking method compared with the state-of-the-art methods on Kinetics-600 and Inter4K datasets, which demonstrate the efficacy of our method ItoV.

翻译：鲁棒水印技术旨在以不可察觉的方式将信息隐藏于载体图像/视频中，并能抵御多种失真。近年来，基于深度学习的图像水印方法在鲁棒性和不可见性方面取得了显著进展。然而，由于高复杂性和计算成本，专注于视频水印的深度神经网络研究较少。本文旨在回答这一研究问题：能否将设计良好的基于深度学习的图像水印高效地适配至视频水印？我们的答案是肯定的。首先，我们重新审视了基于深度学习的水印方法工作流程，并得到一个关键洞见：视频中的时序信息对通用计算机视觉任务可能至关重要，但对特定视频水印任务并非必要。受此启发，我们提出一种名为ItoV的方法，用于高效地将基于深度学习的图像水印适配至视频水印。具体而言，ItoV将视频的时序维度与通道维度合并，使深度神经网络将视频视为图像进行处理。我们进一步探究了不同卷积模块在视频水印中的效果，发现空间卷积是视频水印中的主要影响成分，而深度可分离卷积可在性能影响极小的情况下显著降低计算成本。此外，我们提出一种新的帧损失函数，以约束每个视频片段帧中的水印强度保持一致，从而显著提升不可见性。大量实验表明，在Kinetics-600和Inter4K数据集上，所适配的视频水印方法相较于现有最优方法具有更优越的性能，验证了ItoV方法的有效性。