This paper explores a hierarchical prompting mechanism for the hierarchical image classification (HIC) task. Different from prior HIC methods, our hierarchical prompting is the first to explicitly inject ancestor-class information as a tokenized hint that benefits the descendant-class discrimination. We think it well imitates human visual recognition, i.e., humans may use the ancestor class as a prompt to draw focus on the subtle differences among descendant classes. We model this prompting mechanism into a Transformer with Hierarchical Prompting (TransHP). TransHP consists of three steps: 1) learning a set of prompt tokens to represent the coarse (ancestor) classes, 2) on-the-fly predicting the coarse class of the input image at an intermediate block, and 3) injecting the prompt token of the predicted coarse class into the intermediate feature. Though the parameters of TransHP maintain the same for all input images, the injected coarse-class prompt conditions (modifies) the subsequent feature extraction and encourages a dynamic focus on relatively subtle differences among the descendant classes. Extensive experiments show that TransHP improves image classification on accuracy (e.g., improving ViT-B/16 by +2.83% ImageNet classification accuracy), training data efficiency (e.g., +12.69% improvement under 10% ImageNet training data), and model explainability. Moreover, TransHP also performs favorably against prior HIC methods, showing that TransHP well exploits the hierarchical information.
翻译:本文研究了面向层级图像分类(HIC)任务的层级提示机制。与以往HIC方法不同,本文提出的层级提示首次将祖先类别信息显式注入为令牌化提示,以促进子类别判别。我们认为该方法很好地模拟了人类视觉识别过程,即人类可能利用祖先类别作为提示,聚焦于子类别间的细微差异。我们将该提示机制建模为带层级提示的Transformer(TransHP)。TransHP包含三个步骤:1)学习一组提示令牌以表示粗粒度(祖先)类别;2)在中间模块实时预测输入图像的粗粒度类别;3)将所预测粗粒度类别的提示令牌注入中间特征。尽管TransHP对所有输入图像保持相同参数,但注入的粗粒度类别提示会调控(修正)后续特征提取,并促进对子类别间相对细微差异的动态关注。大量实验表明,TransHP在图像分类准确率(例如将ViT-B/16的ImageNet分类准确率提升2.83%)、训练数据效率(例如在10% ImageNet训练数据下提升12.69%)以及模型可解释性方面均有改善。此外,TransHP在性能上亦优于现有HIC方法,证明其能有效利用层级信息。