Coarse-to-fine path decision-making requires predicting a valid taxonomy path in which earlier decisions constrain later ones. However, existing benchmarks score each level independently, obscuring cross-level validity and consistency. To better align evaluation with this setting, we introduce a Joint Path Decision (JPD) protocol that requires predicting the full path in one call, together with Depth-Weighted Prefix Accuracy (DWPA), a metric family that measures path reliability with tunable emphasis on deeper levels. Under JPD, strong vision-language models (VLMs) frequently produce invalid parent-child pairs and brittle full-path predictions, suggesting that their failures stem not only from incomplete taxonomic knowledge but also from unstable cross-level decision coordination. To address this problem, we propose DuoTeach, a dual-role self-teaching distillation framework that requires no ground-truth labels and reuses the same pretrained VLM in two roles. Its Decision-Conditioned Rollout (DCR) generates more coherent teacher traces by conditioning each level on prior decisions, and distills this coordinated behavior into the student without additional test-time rollouts. Across multiple taxonomy-structured benchmarks and VLM base models, DuoTeach improves in-domain DWPA (alpha = 0.95) by up to 30.24 points and boosts zero-shot performance on unseen taxonomies from 17.17% to 43.66%. Further analyses attribute these gains to improved within-call multi-level decision coordination.
翻译:摘要:粗到细路径决策需要预测一条有效的分类路径,其中早期决策约束后续决策。然而,现有基准测试对每个层级独立评分,模糊了跨层级有效性与一致性。为更好地将评估与此设定对齐,我们提出联合路径决策(JPD)协议,要求单次调用中预测完整路径,同时引入深度加权前缀准确率(DWPA)——这一度量族通过可调节的深层权重衡量路径可靠性。在JPD下,强视觉-语言模型(VLM)频繁产生无效的父子对及脆弱的全路径预测,表明其失败不仅源于不完整的分类知识,更源于不稳定的跨层级决策协调。为解决此问题,我们提出DuoTeach——一种无需真实标签、复用同一预训练VLM的双角色自教学蒸馏框架。其决策条件展开(DCR)通过将每个层级条件化于先前决策,生成更一致的教师轨迹,并将这种协调行为蒸馏至学生模型,而无需额外的测试时展开。在多个分类结构基准和VLM基模型上,DuoTeach将领域内DWPA(α=0.95)提升高达30.24个百分点,并将未见分类的零样本性能从17.17%提升至43.66%。进一步分析表明,这些收益归因于同一次调用中多层级决策协调的改善。