Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .
翻译:当前处理超高分辨率图像分割的方法要么采用滑动窗口策略(从而舍弃全局上下文信息),要么进行下采样操作(导致细节丢失)。本文提出一种简洁而高效的方法,将显式的多尺度推理机制引入视觉Transformer中,同时保留局部细节与全局感知能力。具体而言,我们以并行方式处理每幅图像:在局部尺度(高分辨率小尺寸裁剪块)和全局尺度(低分辨率大尺寸裁剪块)分别进行处理,并通过一组可学习的接力令牌在双分支间实现特征聚合与传递。该设计可直接嵌入标准Transformer骨干网络(如ViT和Swin),且增加的参数量不足2%。在三个超高分辨率分割基准数据集(Archaeoscape、URUR和Gleason)及传统Cityscapes数据集上的大量实验表明,该方法能带来持续的性能提升,最高可获得15%的相对mIoU改进。代码与预训练模型已发布于https://archaeoscape.ai/work/relay-tokens/。