Traditional vision-language models struggle with contrastive fine-grained taxonomic reasoning, particularly when distinguishing between visually similar species within the same genus or family. We introduce TaxonRL, a reinforcement learning approach using Group Relative Policy Optimization with intermediate rewards that decomposes the reasoning process into hierarchical taxonomic predictions. Our method incentivizes models to explicitly reason about species-level, genus-level, and family-level features before making final classifications. This structured approach is designed not only to boost accuracy but also to yield a transparent, verifiable decision-making process. On the challenging Birds-to-Words dataset, TaxonRL achieves 91.7\% average accuracy, exceeding human performance (77.3\%) while generating interpretable reasoning traces. We demonstrate strong cross-domain generalization, showing substantial gains in primate and marine species verification. Our results establish that enforcing structured, hierarchical reasoning provides a powerful and transferable framework for fine-grained visual discrimination.
翻译:传统的视觉-语言模型在对比性细粒度分类学推理方面存在困难,尤其是在区分同一属或同一科中视觉相似物种时。我们提出了TaxonRL,一种采用分组相对策略优化与中间奖励的强化学习方法,将推理过程分解为层次化的分类学预测。该方法激励模型在进行最终分类前,显式地对物种级、属级和科级特征进行推理。这种结构化方法不仅旨在提升准确率,还能产生透明、可验证的决策过程。在具有挑战性的Birds-to-Words数据集上,TaxonRL实现了91.7%的平均准确率,超越了人类表现(77.3%),同时生成可解释的推理轨迹。我们展示了强大的跨领域泛化能力,在灵长类和海洋物种验证任务中取得了显著提升。我们的结果表明,强制执行结构化、层次化的推理为细粒度视觉判别提供了一个强大且可迁移的框架。