Transformer-based models have recently achieved outstanding performance in image matting. However, their application to high-resolution images remains challenging due to the quadratic complexity of global self-attention. To address this issue, we propose MEMatte, a \textbf{m}emory-\textbf{e}fficient \textbf{m}atting framework for processing high-resolution images. MEMatte incorporates a router before each global attention block, directing informative tokens to the global attention while routing other tokens to a Lightweight Token Refinement Module (LTRM). Specifically, the router employs a local-global strategy to predict the routing probability of each token, and the LTRM utilizes efficient modules to simulate global attention. Additionally, we introduce a Batch-constrained Adaptive Token Routing (BATR) mechanism, which allows each router to dynamically route tokens based on image content and the stages of attention block in the network. Furthermore, we construct an ultra high-resolution image matting dataset, UHR-395, comprising 35,500 training images and 1,000 test images, with an average resolution of $4872\times6017$. This dataset is created by compositing 395 different alpha mattes across 11 categories onto various backgrounds, all with high-quality manual annotation. Extensive experiments demonstrate that MEMatte outperforms existing methods on both high-resolution and real-world datasets, significantly reducing memory usage by approximately 88% and latency by 50% on the Composition-1K benchmark. Our code is available at https://github.com/linyiheng123/MEMatte.
翻译:基于Transformer的模型近期在图像抠图任务中取得了卓越性能。然而,由于全局自注意力的二次复杂度,其在高分辨率图像上的应用仍面临挑战。为解决这一问题,我们提出MEMatte,一种用于处理高分辨率图像的**内存高效抠图**框架。MEMatte在每个全局注意力块前引入路由器,将信息丰富的令牌导向全局注意力,同时将其他令牌路由至轻量级令牌优化模块(LTRM)。具体而言,路由器采用局部-全局策略预测每个令牌的路由概率,而LTRM则利用高效模块模拟全局注意力。此外,我们提出批量约束自适应令牌路由(BATR)机制,使每个路由器能够根据图像内容及网络中注意力块的阶段动态路由令牌。进一步地,我们构建了超高分辨率图像抠图数据集UHR-395,包含35,500张训练图像和1,000张测试图像,平均分辨率为$4872\times6017$。该数据集通过将11个类别中的395种不同透明度遮罩合成至多样化背景而创建,所有数据均经过高质量人工标注。大量实验表明,MEMatte在高分辨率数据集和真实场景数据集上均优于现有方法,在Composition-1K基准测试中显著降低约88%的内存占用和50%的延迟。代码已开源:https://github.com/linyiheng123/MEMatte。