Hierarchical image classification predicts labels across a semantic taxonomy, but existing methods typically assume complete, fine-grained annotations, an assumption rarely met in practice. Real-world supervision varies in granularity, influenced by image quality, annotator expertise, and task demands; a distant bird may be labeled Bird, while a close-up reveals Bald eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet and structured into cognitively inspired basic, subordinate, and fine-grained levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic, mixed-granularity labels reflecting human annotation behavior. We propose free-grain learning, with heterogeneous supervision across instances. We develop methods that enhance semantic guidance via pseudo-attributes from vision-language models and visual guidance via semi-supervised learning. These, along with strong baselines, substantially improve performance under mixed supervision. Together, our benchmark and methods advance hierarchical classification under real-world constraints.
翻译:层次图像分类在语义分类体系中进行标签预测,但现有方法通常假设存在完整、细粒度的标注,这一假设在实践中很少得到满足。现实世界中的监督信息在粒度上存在差异,这受到图像质量、标注者专业知识和任务需求的影响;例如,一只远处的鸟可能被标注为“鸟”,而一张特写照片则能识别为“白头海雕”。我们提出了ImageNet-F,这是一个从ImageNet中整理并构建的大规模基准数据集,其结构受认知启发分为基础、从属和细粒度三个层次。利用CLIP作为语义模糊性的代理,我们模拟了反映人类标注行为的、现实且混合粒度的标签。我们提出了自由粒度学习,允许实例间存在异质性的监督信息。我们开发了通过视觉语言模型生成的伪属性来增强语义引导的方法,以及通过半监督学习来增强视觉引导的方法。这些方法,连同强大的基线模型,在混合监督条件下显著提升了性能。我们的基准数据集和方法共同推动了现实世界约束下的层次分类研究进展。