Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k-nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA's fixed step scale can lead to over-squashing and missing multiple connections to gain the same information that could be gained from a long-range link. Through this observation, we propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long-range links. To this end, we propose LogViG, a novel hybrid CNN-GNN model that utilizes LSGC. Furthermore, inspired by the successes of multi-scale and high-resolution architectures, we introduce and apply a high-resolution branch and fuse features between our high-resolution and low-resolution branches for a multi-scale high-resolution Vision GNN network. Extensive experiments show that LogViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification and semantic segmentation tasks. Our smallest model, Ti-LogViG, achieves an average top-1 accuracy on ImageNet-1K of 79.9% with a standard deviation of 0.2%, 1.7% higher average accuracy than Vision GNN with a 24.3% reduction in parameters and 35.3% reduction in GMACs. Our work shows that leveraging long-range links in graph construction for ViGs through our proposed LSGC can exceed the performance of current state-of-the-art ViGs. Code is available at https://github.com/mmunir127/LogViG-Official.
翻译:视觉图神经网络(ViG)作为传统卷积神经网络(CNN)与视觉Transformer(ViT)的有力竞争者,已在视觉任务中展现出潜力;然而,常见的图构建方法(如k近邻法)在较大图像上计算代价高昂。尽管稀疏视觉图注意力(SVGA)等方法已显示出前景,但SVGA的固定步长尺度可能导致过度挤压,并因缺失多个连接而损失可通过长程链接获取的相同信息。基于此观察,我们提出一种新的图构建方法——对数可缩放图构建(LSGC),通过限制长程链接数量来提升性能。为此,我们提出LogViG,一种利用LSGC的新型CNN-GNN混合模型。此外,受多尺度与高分辨率架构成功经验的启发,我们引入高分辨率分支,并在高分辨率与低分辨率分支间融合特征,构建了多尺度高分辨率的视觉图神经网络。大量实验表明,在图像分类与语义分割任务中,LogViG在准确率、GMACs运算量与参数量上均优于现有ViG、CNN及ViT架构。我们最小的模型Ti-LogViG在ImageNet-1K上实现了79.9%的平均top-1准确率(标准差0.2%),较Vision GNN平均准确率提升1.7%,同时参数量减少24.3%,GMACs降低35.3%。本研究表明,通过我们提出的LSGC方法在ViG图构建中利用长程链接,能够超越当前最先进的ViG模型性能。代码发布于https://github.com/mmunir127/LogViG-Official。