Shadow removal in a single image has received increasing attention in recent years. However, removing shadows over dynamic scenes remains largely under-explored. In this paper, we propose the first data-driven video shadow removal model, termed PSTNet, by exploiting three essential characteristics of video shadows, i.e., physical property, spatio relation, and temporal coherence. Specifically, a dedicated physical branch was established to conduct local illumination estimation, which is more applicable for scenes with complex lighting and textures, and then enhance the physical features via a mask-guided attention strategy. Then, we develop a progressive aggregation module to enhance the spatio and temporal characteristics of features maps, and effectively integrate the three kinds of features. Furthermore, to tackle the lack of datasets of paired shadow videos, we synthesize a dataset (SVSRD-85) with aid of the popular game GTAV by controlling the switch of the shadow renderer. Experiments against 9 state-of-the-art models, including image shadow removers and image/video restoration methods, show that our method improves the best SOTA in terms of RMSE error for the shadow area by 14.7. In addition, we develop a lightweight model adaptation strategy to make our synthetic-driven model effective in real world scenes. The visual comparison on the public SBU-TimeLapse dataset verifies the generalization ability of our model in real scenes.
翻译:单幅图像中的阴影去除近年来受到了越来越多的关注。然而,动态场景中的阴影去除在很大程度上仍未得到充分探索。本文提出了首个数据驱动的视频阴影去除模型,命名为PSTNet,通过利用视频阴影的三个基本特性,即物理属性、空间关系和时间一致性。具体而言,我们建立了一个专门的物理分支,用于进行局部光照估计,这更适用于具有复杂光照和纹理的场景,并通过掩码引导的注意力策略增强物理特征。随后,我们开发了一个渐进式聚合模块,以增强特征图的空间和时间特性,并有效整合三类特征。此外,为解决配对阴影视频数据集的缺乏问题,我们借助流行游戏GTAV,通过控制阴影渲染器的开关合成了一个数据集(SVSRD-85)。与9个最先进模型的实验对比,包括图像阴影去除方法和图像/视频恢复方法,表明我们的方法在阴影区域的RMSE误差方面将最佳现有技术水平提升了14.7。此外,我们开发了一种轻量级模型自适应策略,使我们的合成驱动模型在真实场景中有效。在公共SBU-TimeLapse数据集上的视觉对比验证了我们的模型在真实场景中的泛化能力。