Vision transformers (ViTs) achieve remarkable performance on large datasets, but tend to perform worse than convolutional neural networks (CNNs) when trained from scratch on smaller datasets, possibly due to a lack of local inductive bias in the architecture. Recent studies have therefore added locality to the architecture and demonstrated that it can help ViTs achieve performance comparable to CNNs in the small-size dataset regime. Existing methods, however, are architecture-specific or have higher computational and memory costs. Thus, we propose a module called Local InFormation Enhancer (LIFE) that extracts patch-level local information and incorporates it into the embeddings used in the self-attention block of ViTs. Our proposed module is memory and computation efficient, as well as flexible enough to process auxiliary tokens such as the classification and distillation tokens. Empirical results show that the addition of the LIFE module improves the performance of ViTs on small image classification datasets. We further demonstrate how the effect can be extended to downstream tasks, such as object detection and semantic segmentation. In addition, we introduce a new visualization method, Dense Attention Roll-Out, specifically designed for dense prediction tasks, allowing the generation of class-specific attention maps utilizing the attention maps of all tokens.
翻译:视觉Transformer(ViT)在大型数据集上取得了显著性能,但在较小数据集上从头训练时往往表现不如卷积神经网络(CNN),这可能是由于其架构缺乏局部归纳偏置。近期研究通过在架构中引入局部性,证明这有助于ViT在小尺寸数据集上达到与CNN相当的性能。然而,现有方法要么架构特异,要么计算和内存成本较高。为此,我们提出名为局部信息增强器(LIFE)的模块,该模块提取补丁级别的局部信息,并将其融入ViT自注意力机制中使用的嵌入表示中。所提模块在内存和计算上高效,且足够灵活以处理分类标记和蒸馏标记等辅助标记。实验结果表明,LIFE模块的加入提升了ViT在小图像分类数据集上的性能。我们进一步展示了该效果如何扩展至下游任务(如目标检测和语义分割)。此外,针对密集预测任务,我们提出了一种新型可视化方法——密集注意力展开(Dense Attention Roll-Out),该方法通过利用所有标记的注意力图生成类别特定的注意力图。