Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search

Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{https://github.com/MarvinYu1995/HyCTAS}.

翻译：图像分割是计算机视觉中最基本的问题之一，因其在图像理解和自动驾驶中的广泛应用而备受关注。然而，设计高效且有效的分割神经网络架构是一项劳动密集型工作，可能需要人类专家进行大量试验。在本文中，我们通过利用架构搜索，解决了将多头自注意力高效集成到高分辨率表征卷积神经网络中的挑战。由于维持高分辨率需要昂贵的存储开销，手动用多头自注意力替换卷积层并非易事。相比之下，我们提出了一种多目标多分支超网络方法，该方法不仅充分利用了高分辨率特征的优势，还能找到放置多头自注意力模块的合适位置。我们的搜索算法针对多个目标（如延迟和平均交并比）进行了优化，能够在单次搜索中找到任意分支数量的帕累托前沿架构。此外，我们通过混合卷积-Transformer架构搜索（HyCTAS）方法进一步推出了一系列模型，该方法在不同分辨率的支路之间搜索轻量级卷积层和内存高效自注意力层的最佳混合组合，并将其融合为高分辨率，兼顾效率与效果。大量实验表明，HyCTAS在语义分割任务上优于先前方法。代码和模型可在 \url{https://github.com/MarvinYu1995/HyCTAS} 获取。