Zero-Shot Compositional Action Recognition (ZS-CAR) requires recognizing novel verb-object combinations composed of previously observed primitives. In this work, we tackle a key failure mode: models predict verbs via object-driven shortcuts (i.e., relying on the labeled object class) rather than temporal evidence. We argue that sparse compositional supervision and verb-object learning asymmetry can promote object-driven shortcut learning. Our analysis with proposed diagnostic metrics shows that existing methods overfit to training co-occurrence patterns and underuse temporal verb cues, resulting in weak generalization to unseen compositions. To address object-driven shortcuts, we propose Robust COmpositional REpresentations (RCORE) with two components. Co-occurrence Prior Regularization (CPR) adds explicit supervision for unseen compositions and regularizes the model against frequent co-occurrence priors by treating them as hard negatives. Temporal Order Regularization for Composition (TORC) enforces temporal-order sensitivity to learn temporally grounded verb representations. Across Sth-com and EK100-com, RCORE reduces shortcut diagnostics and consequently improves compositional generalization.
翻译:零样本组合动作识别(ZS-CAR)要求识别由先前观察到的基本单元组成的新颖动词-宾语组合。本文针对一个关键失败模式展开研究:模型通过物体驱动的捷径(即依赖标注的物体类别)而非时间线索来预测动词。我们认为稀疏的组合监督以及动词-物体学习不对称性可能促进物体驱动的捷径学习。基于提出的诊断性度量指标的分析表明,现有方法过度拟合训练共现模式且未充分利用动词的时间线索,导致对未见组合的泛化能力薄弱。为解决物体驱动捷径问题,我们提出鲁棒组合表示(RCORE),包含两个组成部分。共现先验正则化(CPR)为未见组合添加显式监督,并通过将频繁共现先验视为困难负样本对模型进行正则化;组合时序顺序正则化(TORC)强制模型关注时序顺序敏感性,以学习基于时间线索的动词表征。在Sth-com和EK100-com数据集上,RCORE减少了捷径诊断指标,从而提升了组合泛化性能。