ROI-Aware Multiscale Cross-Attention Vision Transformer for Pest Image Identification

The pests captured with imaging devices may be relatively small in size compared to the entire images, and complex backgrounds have colors and textures similar to those of the pests, which hinders accurate feature extraction and makes pest identification challenging. The key to pest identification is to create a model capable of detecting regions of interest (ROIs) and transforming them into better ones for attention and discriminative learning. To address these problems, we will study how to generate and update the ROIs via multiscale cross-attention fusion as well as how to be highly robust to complex backgrounds and scale problems. Therefore, we propose a novel ROI-aware multiscale cross-attention vision transformer (ROI-ViT). The proposed ROI-ViT is designed using dual branches, called Pest and ROI branches, which take different types of maps as input: Pest images and ROI maps. To render such ROI maps, ROI generators are built using soft segmentation and a class activation map and then integrated into the ROI-ViT backbone. Additionally, in the dual branch, complementary feature fusion and multiscale hierarchies are implemented via a novel multiscale cross-attention fusion. The class token from the Pest branch is exchanged with the patch tokens from the ROI branch, and vice versa. The experimental results show that the proposed ROI-ViT achieves 81.81%, 99.64%, and 84.66% for IP102, D0, and SauTeg pest datasets, respectively, outperforming state-of-the-art (SOTA) models, such as MViT, PVT, DeiT, Swin-ViT, and EfficientNet. More importantly, for the new challenging dataset IP102(CBSS) that contains only pest images with complex backgrounds and small sizes, the proposed model can maintain high recognition accuracy, whereas that of other SOTA models decrease sharply, demonstrating that our model is more robust to complex background and scale problems.

翻译：成像设备捕获的害虫在整幅图像中尺寸相对较小，且复杂背景具有与害虫相似的颜色和纹理，这阻碍了精确特征提取，使得害虫识别极具挑战性。害虫识别的关键在于构建能够检测感兴趣区域（ROI）并将其转换为更优区域以进行注意力学习和判别学习的模型。为解决这些问题，我们研究了如何通过多尺度交叉注意力融合生成与更新ROI，以及如何使模型对复杂背景和尺度问题具有高鲁棒性。为此，我们提出一种新颖的ROI感知多尺度交叉注意力视觉Transformer（ROI-ViT）。所提出的ROI-ViT采用双分支设计，分别称为害虫分支和ROI分支，以不同类型图像作为输入：害虫图像和ROI图。为生成此类ROI图，我们利用软分割和类激活图构建ROI生成器，并将其集成至ROI-ViT主干网络中。此外，在双分支中通过新颖的多尺度交叉注意力融合实现互补特征融合与多尺度层次结构，其中害虫分支的类别令牌与ROI分支的补丁令牌进行交换，反之亦然。实验结果表明，所提出的ROI-ViT在IP102、D0和SauTeg害虫数据集上分别达到81.81%、99.64%和84.66%的准确率，优于MViT、PVT、DeiT、Swin-ViT及EfficientNet等当前最优（SOTA）模型。更重要的是，在仅包含复杂背景与小尺寸害虫图像的新挑战性数据集IP102(CBSS)上，本模型仍能保持高识别准确率，而其他SOTA模型准确率急剧下降，证明了本模型对复杂背景和尺度问题具有更强的鲁棒性。