Retrieval augmented models are becoming increasingly popular for computer vision tasks after their recent success in NLP problems. The goal is to enhance the recognition capabilities of the model by retrieving similar examples for the visual input from an external memory set. In this work, we introduce an attention-based memory module, which learns the importance of each retrieved example from the memory. Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query. We also thoroughly study various ways of constructing the memory dataset. Our experiments show the benefit of using a massive-scale memory dataset of 1B image-text pairs, and demonstrate the performance of different memory representations. We evaluate our method in three different classification tasks, namely long-tailed recognition, learning with noisy labels, and fine-grained classification, and show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
翻译:检索增强型模型在自然语言处理问题中取得成功后,正逐渐在计算机视觉任务中广泛应用。其目标是通过从外部记忆集中检索与视觉输入相似的示例来增强模型的识别能力。本文提出一种基于注意力的记忆模块,该模块能够学习从记忆集中检索到的每个示例的重要性。与现有方法相比,我们的方法消除了无关检索示例的影响,并保留了那些对输入查询有益的示例。我们还深入研究了构建记忆数据集的各种方法。实验表明,使用包含10亿图文对的超大规模记忆数据集能带来显著优势,并展示了不同记忆表示的性能。我们在长尾识别、噪声标签学习和细粒度分类三种不同分类任务上评估了我们的方法,结果显示该方法在ImageNet-LT、Places-LT和Webvision数据集上均达到了最先进的准确率。