Recently, plain vision Transformers (ViTs) have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of image matting. We hypothesize that image matting could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a hybrid attention mechanism combined with a convolution neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple lightweight convolutions to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on image matting with concise adaptation. It inherits many superior properties from ViT to matting, including various pretraining strategies, concise architecture design, and flexible inference strategies. We evaluate ViTMatte on Composition-1k and Distinctions-646, the most commonly used benchmark for image matting, our method achieves state-of-the-art performance and outperforms prior matting works by a large margin.
翻译:近年来,纯视觉Transformer(ViT)凭借其强大的建模能力和大规模预训练,在各类计算机视觉任务中展现出卓越性能。然而,这类模型尚未攻克图像抠图问题。我们假设ViT同样能推动图像抠图的发展,并提出一种基于ViT的高效稳健抠图系统——ViTMatte。该方法包含:(i)混合注意力机制结合卷积颈,帮助ViT在抠图任务中实现性能与计算量的出色平衡;(ii)引入仅由轻量卷积构成的细节捕获模块,用于补充抠图所需的细节信息。据我们所知,ViTMatte是首个通过简洁适配挖掘ViT在图像抠图领域潜力的工作,它继承了ViT的多种优越特性,包括多样化的预训练策略、简明的架构设计以及灵活的推理策略。在图像抠图领域最常用的基准数据集Composition-1k和Distinctions-646上,ViTMatte取得了最先进的性能,并以显著优势超越了先前所有抠图方法。