Automatic image colorization is inherently an ill-posed problem with uncertainty, which requires an accurate semantic understanding of scenes to estimate reasonable colors for grayscale images. Although recent interaction-based methods have achieved impressive performance, it is still a very difficult task to infer realistic and accurate colors for automatic colorization. To reduce the difficulty of semantic understanding of grayscale scenes, this paper tries to utilize corresponding audio, which naturally contains extra semantic information about the same scene. Specifically, a novel audio-infused automatic image colorization (AIAIC) network is proposed, which consists of three stages. First, we take color image semantics as a bridge and pretrain a colorization network guided by color image semantics. Second, the natural co-occurrence of audio and video is utilized to learn the color semantic correlations between audio and visual scenes. Third, the implicit audio semantic representation is fed into the pretrained network to finally realize the audio-guided colorization. The whole process is trained in a self-supervised manner without human annotation. In addition, an audiovisual colorization dataset is established for training and testing. Experiments demonstrate that audio guidance can effectively improve the performance of automatic colorization, especially for some scenes that are difficult to understand only from visual modality.
翻译:自动图像彩色化本质上是一个具有不确定性的不适定问题,需要准确理解场景语义才能为灰度图像估计合理的颜色。尽管近期基于交互的方法取得了显著性能,但自动彩色化中推断真实且准确的颜色仍然是一项极具挑战的任务。为降低灰度场景语义理解的难度,本文尝试利用对应音频——其天然包含同一场景的额外语义信息。具体而言,本文提出一种新颖的音频融合自动图像彩色化(AIAIC)网络,包含三个阶段。首先,以彩色图像语义为桥梁,预训练一个由彩色图像语义引导的彩色化网络。其次,利用音频与视频的自然共现特性学习音频与视觉场景之间的颜色语义关联。最后,将隐式音频语义表征输入预训练网络,最终实现音频引导的彩色化。整个过程以自监督方式训练,无需人工标注。此外,本文构建了一个音视频彩色化数据集用于训练与测试。实验表明,音频引导能够有效提升自动彩色化的性能,尤其对于仅从视觉模态难以理解的场景效果显著。