Hierarchical Dynamic Image Harmonization

Image harmonization is a critical task in computer vision, which aims to adjust the foreground to make it compatible with the background. Recent works mainly focus on using global transformations (i.e., normalization and color curve rendering) to achieve visual consistency. However, these models ignore local visual consistency and their huge model sizes limit their harmonization ability on edge devices. In this paper, we propose a hierarchical dynamic network (HDNet) to adapt features from local to global view for better feature transformation in efficient image harmonization. Inspired by the success of various dynamic models, local dynamic (LD) module and mask-aware global dynamic (MGD) module are proposed in this paper. Specifically, LD matches local representations between the foreground and background regions based on semantic similarities, then adaptively adjust every foreground local representation according to the appearance of its $K$-nearest neighbor background regions. In this way, LD can produce more realistic images at a more fine-grained level, and simultaneously enjoy the characteristic of semantic alignment. The MGD effectively applies distinct convolution to the foreground and background, learning the representations of foreground and background regions as well as their correlations to the global harmonization, facilitating local visual consistency for the images much more efficiently. Experimental results demonstrate that the proposed HDNet significantly reduces the total model parameters by more than 80\% compared to previous methods, while still attaining state-of-the-art performance on the popular iHarmony4 dataset. Notably, the HDNet achieves a 4\% improvement in PSNR and a 19\% reduction in MSE compared to the prior state-of-the-art methods.

翻译：图像和谐化是计算机视觉中的一项关键任务，旨在调整前景以使其与背景协调一致。近期研究主要集中于使用全局变换（即归一化和色彩曲线渲染）来实现视觉一致性。然而，这些模型忽略了局部视觉一致性，且其庞大的模型尺寸限制了其在边缘设备上的和谐化能力。本文提出了一种层级动态网络（HDNet），从局部到全局视角自适应调整特征，以实现高效的图像和谐化中的特征变换。受各种动态模型成功应用的启发，本文提出了局部动态（LD）模块和掩码感知全局动态（MGD）模块。具体而言，LD基于语义相似性匹配前景与背景区域的局部表示，然后根据其$K$-近邻背景区域的外观自适应调整每个前景局部表示。通过这种方式，LD能够在更精细的粒度上生成更逼真的图像，同时具备语义对齐的特性。MGD将不同的卷积操作有效地应用于前景和背景，学习前景和背景区域的表示及其与全局和谐化的相关性，从而更高效地促进图像的局部视觉一致性。实验结果表明，与先前方法相比，所提出的HDNet在总模型参数上减少了超过80%，同时在流行的iHarmony4数据集上仍达到最先进的性能。值得注意的是，相较于先前的最先进方法，HDNet在PSNR上提升了4%，并在MSE上降低了19%。