We present a lightweight model for high resolution portrait matting. The model does not use any auxiliary inputs such as trimaps or background captures and achieves real time performance for HD videos and near real time for 4K. Our model is built upon a two-stage framework with a low resolution network for coarse alpha estimation followed by a refinement network for local region improvement. However, a naive implementation of the two-stage model suffers from poor matting quality if not utilizing any auxiliary inputs. We address the performance gap by leveraging the vision transformer (ViT) as the backbone of the low resolution network, motivated by the observation that the tokenization step of ViT can reduce spatial resolution while retain as much pixel information as possible. To inform local regions of the context, we propose a novel cross region attention (CRA) module in the refinement network to propagate the contextual information across the neighboring regions. We demonstrate that our method achieves superior results and outperforms other baselines on three benchmark datasets while only uses $1/20$ of the FLOPS compared to the existing state-of-the-art model.
翻译:本文提出一种用于高分辨率人像抠图的轻量化模型。该模型无需使用trimap或背景图像等辅助输入,即可实现高清视频的实时处理与4K视频的近实时处理。模型采用两阶段框架:先通过低分辨率网络进行粗Alpha估计,再通过细化网络优化局部区域。然而,若未采用任何辅助输入,该两阶段框架的朴素实现会导致抠图质量欠佳。针对性能差距,我们受视觉Transformer(ViT)的标记化步骤可在降低空间分辨率时最大限度保留像素信息的启发,采用ViT作为低分辨率网络的主干架构。为向局部区域传递上下文信息,我们在细化网络中提出新型跨区域注意力模块(CRA),通过相邻区域间的上下文信息传播实现区域感知。实验证明,该方法在三个基准数据集上取得优越结果,优于现有基线模型,且计算量仅为当前最优模型的1/20(浮点运算次数)。