Fine-grained Recognition with Learnable Semantic Data Augmentation

Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. The source code will be released.

翻译：细粒度图像识别是一个长期存在的计算机视觉挑战，其核心在于区分属于同一元类别下多个子类别的目标。由于同一元类别内的图像通常具有相似的视觉外观，挖掘具有判别性的视觉线索成为区分细粒度类别的关键。尽管常用的图像级数据增强技术在通用图像分类问题中取得了巨大成功，但它们在细粒度场景中却很少被应用，因为其随机编辑区域的行为容易破坏存在于细微区域中的判别性视觉线索。本文提出在特征层面多样化训练数据，以缓解判别性区域丢失问题。具体而言，通过沿具有语义意义的方向平移图像特征，我们生成增强后的多样化样本。这些语义方向通过协方差预测网络进行估计，该网络预测样本级协方差矩阵，以适应细粒度图像中固有的类内大变异特性。此外，协方差预测网络与分类网络以元学习方式联合优化，以缓解退化解问题。在四个具有挑战性的细粒度识别基准数据集（CUB-200-2011、Stanford Cars、FGVC Aircrafts、NABirds）上的实验表明，我们的方法显著提升了多种主流分类网络（如ResNets、DenseNets、EfficientNets、RegNets和ViT）的泛化性能。结合近期提出的方法，我们的语义数据增强方法在CUB-200-2011数据集上达到了最先进的性能。源代码将公开发布。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日