Fine-Grained Visual Recognition (FGVR) tackles the problem of distinguishing highly similar categories. One of the main approaches to FGVR, namely subset learning, tries to leverage information from existing class taxonomies to improve the performance of deep neural networks. However, these methods rely on the existence of handcrafted hierarchies that are not necessarily optimal for the models. In this paper, we propose ELFIS, an expert learning framework for FGVR that clusters categories of the dataset into meta-categories using both dataset-inherent lexical and model-specific information. A set of neural networks-based experts are trained focusing on the meta-categories and are integrated into a multi-task framework. Extensive experimentation shows improvements in the SoTA FGVR benchmarks of up to +1.3% of accuracy using both CNNs and transformer-based networks. Overall, the obtained results evidence that ELFIS can be applied on top of any classification model, enabling the obtention of SoTA results. The source code will be made public soon.
翻译:细粒度视觉识别(FGVR)旨在解决高度相似类别的区分问题。作为FGVR的主要方法之一,子集学习尝试利用现有类别层级结构中的信息来提升深度神经网络的性能。然而,这些方法依赖于并非针对模型最优的人工构建层级。本文提出ELFIS——一种面向FGVR的专家学习框架,该框架同时利用数据集固有的词汇信息和模型特定信息,将数据集的类别聚类为元类别。一组基于神经网络的专家模型专注于这些元类别进行训练,并整合到多任务框架中。大量实验表明,在使用CNN与基于Transformer的网络上,该方法在最新FGVR基准测试中准确率提升高达+1.3%。整体而言,实验结果证明ELFIS可应用于任何分类模型之上,从而获得最先进的结果。源代码将很快公开。