Universal Prototype Transport for Zero-Shot Action Recognition and Localization

This work addresses the problem of recognizing action categories in videos when no training examples are available. The current state-of-the-art enables such a zero-shot recognition by learning universal mappings from videos to a semantic space, either trained on large-scale seen actions or on objects. While effective, we find that universal action and object mappings are biased to specific regions in the semantic space. These biases lead to a fundamental problem: many unseen action categories are simply never inferred during testing. For example on UCF-101, a quarter of the unseen actions are out of reach with a state-of-the-art universal action model. To that end, this paper introduces universal prototype transport for zero-shot action recognition. The main idea is to re-position the semantic prototypes of unseen actions by matching them to the distribution of all test videos. For universal action models, we propose to match distributions through a hyperspherical optimal transport from unseen action prototypes to the set of all projected test videos. The resulting transport couplings in turn determine the target prototype for each unseen action. Rather than directly using the target prototype as final result, we re-position unseen action prototypes along the geodesic spanned by the original and target prototypes as a form of semantic regularization. For universal object models, we outline a variant that defines target prototypes based on an optimal transport between unseen action prototypes and object prototypes. Empirically, we show that universal prototype transport diminishes the biased selection of unseen action prototypes and boosts both universal action and object models for zero-shot classification and spatio-temporal localization.

翻译：本工作针对无训练样本时视频中动作类别的识别问题。当前最先进的方法通过学习从视频到语义空间的通用映射（基于大规模可见动作或物体训练）实现零样本识别。尽管有效，我们发现通用动作与物体映射会偏向语义空间中的特定区域。这种偏差导致根本性问题：许多未见动作类别在测试中始终无法被推断。例如在UCF-101数据集上，使用最先进的通用动作模型时，四分之一的未见动作完全不可及。为此，本文提出了面向零样本动作识别的通用原型传输方法。其核心思想是通过将未见动作的语义原型与所有测试视频的分布对齐来重新定位这些原型。对于通用动作模型，我们提出通过超球面最优传输将未见动作原型映射至所有投影测试视频的集合，从而实现分布匹配。由此产生的传输耦合决定了每个未见动作的目标原型。为避免直接使用目标原型作为最终结果，我们沿原始原型与目标原型构成的测地线重新定位未见动作原型，以此作为语义正则化手段。对于通用物体模型，我们概述了一种变体方法：基于未见动作原型与物体原型间的最优传输定义目标原型。实验表明，通用原型传输能够减少未动作原型的偏选，并同时提升通用动作与物体模型在零样本分类与时空定位任务中的性能。