Distinguishing among different marine benthic habitat characteristics is of key importance in a wide set of seabed operations ranging from installations of oil rigs to laying networks of cables and monitoring the impact of humans on marine ecosystems. The Side-Scan Sonar (SSS) is a widely used imaging sensor in this regard. It produces high-resolution seafloor maps by logging the intensities of sound waves reflected back from the seafloor. In this work, we leverage these acoustic intensity maps to produce pixel-wise categorization of different seafloor types. We propose a novel architecture adapted from the Vision Transformer (ViT) in an encoder-decoder framework. Further, in doing so, the applicability of ViTs is evaluated on smaller datasets. To overcome the lack of CNN-like inductive biases, thereby making ViTs more conducive to applications in low data regimes, we propose a novel feature extraction module to replace the Multi-layer Perceptron (MLP) block within transformer layers and a novel module to extract multiscale patch embeddings. A lightweight decoder is also proposed to complement this design in order to further boost multiscale feature extraction. With the modified architecture, we achieve state-of-the-art results and also meet real-time computational requirements. We make our code available at ~\url{https://github.com/hayatrajani/s3seg-vit
翻译:区分不同海洋底栖栖息地特征对于广泛的海底作业(从石油钻井平台安装、电缆网络铺设到人类活动对海洋生态系统影响的监测)至关重要。侧扫声纳(SSS)是此类应用中广泛使用的成像传感器,通过记录海底反射声波强度生成高分辨率海床地图。本研究利用这些声强分布图实现不同海床类型的像素级分类。我们提出了一种基于视觉Transformer(ViT)的编码器-解码器新型架构,并在此过程中评估了ViT在小规模数据集上的适用性。为克服缺少卷积神经网络(CNN)类归纳偏置的问题,使ViT更适用于低数据量场景,我们提出了用于替代Transformer层内多层感知机(MLP)模块的新型特征提取模块,以及用于生成多尺度图像块嵌入的新型模块。同时设计了一种轻量化解码器以增强多尺度特征提取能力。改进后的架构在实现实时计算需求的同时达到了最优性能。相关代码已开源至~\url{https://github.com/hayatrajani/s3seg-vit}。