Fine-grained image recognition is a longstanding computer vision challenge that focuses on differentiating objects belonging to multiple subordinate categories within the same meta-category. Since images belonging to the same meta-category usually share similar visual appearances, mining discriminative visual cues is the key to distinguishing fine-grained categories. Although commonly used image-level data augmentation techniques have achieved great success in generic image classification problems, they are rarely applied in fine-grained scenarios, because their random editing-region behavior is prone to destroy the discriminative visual cues residing in the subtle regions. In this paper, we propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem. Specifically, we produce diversified augmented samples by translating image features along semantically meaningful directions. The semantic directions are estimated with a covariance prediction network, which predicts a sample-wise covariance matrix to adapt to the large intra-class variation inherent in fine-grained images. Furthermore, the covariance prediction network is jointly optimized with the classification network in a meta-learning manner to alleviate the degenerate solution problem. Experiments on four competitive fine-grained recognition benchmarks (CUB-200-2011, Stanford Cars, FGVC Aircrafts, NABirds) demonstrate that our method significantly improves the generalization performance on several popular classification networks (e.g., ResNets, DenseNets, EfficientNets, RegNets and ViT). Combined with a recently proposed method, our semantic data augmentation approach achieves state-of-the-art performance on the CUB-200-2011 dataset. The source code will be released.
翻译:细粒度图像识别是一个长期存在的计算机视觉挑战,其核心在于区分属于同一元类别下多个子类别的目标。由于同一元类别内的图像通常具有相似的视觉外观,挖掘具有判别性的视觉线索成为区分细粒度类别的关键。尽管常用的图像级数据增强技术在通用图像分类问题中取得了巨大成功,但它们在细粒度场景中却很少被应用,因为其随机编辑区域的行为容易破坏存在于细微区域中的判别性视觉线索。本文提出在特征层面多样化训练数据,以缓解判别性区域丢失问题。具体而言,通过沿具有语义意义的方向平移图像特征,我们生成增强后的多样化样本。这些语义方向通过协方差预测网络进行估计,该网络预测样本级协方差矩阵,以适应细粒度图像中固有的类内大变异特性。此外,协方差预测网络与分类网络以元学习方式联合优化,以缓解退化解问题。在四个具有挑战性的细粒度识别基准数据集(CUB-200-2011、Stanford Cars、FGVC Aircrafts、NABirds)上的实验表明,我们的方法显著提升了多种主流分类网络(如ResNets、DenseNets、EfficientNets、RegNets和ViT)的泛化性能。结合近期提出的方法,我们的语义数据增强方法在CUB-200-2011数据集上达到了最先进的性能。源代码将公开发布。