Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifically, we design a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency. In addition, we design an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, we build a benchmark containing various high-quality segmentation datasets and jointly train one efficient yet high-quality segmentation model using this benchmark. Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models. For example, compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. Code and models will be publicly available.
翻译:基于Transformer的分割方法在处理高分辨率图像时面临高效推理的挑战。最近,几种线性注意力架构(如Mamba和RWKV)因其能高效处理长序列而备受关注。在本工作中,我们专注于通过探索这些不同架构来设计一种高效的任意分割模型。具体而言,我们设计了一个包含卷积和RWKV操作的混合骨干网络,在精度和效率上均达到最佳。此外,我们设计了一个高效解码器,以利用多尺度标记来获得高质量掩码。我们将该方法命名为RWKV-SAM,这是一个为类SAM模型设计的简单、有效、快速的基线。此外,我们构建了一个包含多种高质量分割数据集的基准测试集,并利用该基准联合训练了一个高效且高质量的分割模型。基于基准测试结果,与Transformer及其他线性注意力模型相比,我们的RWKV-SAM在效率和分割质量上均取得了卓越性能。例如,与同等规模的Transformer模型相比,RWKV-SAM实现了超过2倍的加速,并在多个数据集上取得了更好的分割性能。此外,RWKV-SAM在分类和语义分割结果上优于近期的视觉Mamba模型。代码与模型将公开提供。