We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e. through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by using the representations learned as input for long-term video understanding tasks (e.g. Episodic Memory in Ego4D). In all cases the performance improves over the state of the art -- even compared to networks trained with far larger batch sizes. We also show that by using noisy image-level detection as pseudo-labels in training, the model learns to provide better bounding boxes using video consistency, as well as grounding the words in the associated text descriptions. Overall, we show that the model can act as a drop-in replacement for an ego-centric video model to improve performance through visual-text grounding.
翻译:我们提出了一种面向物体的解码器,旨在提升自我中心视频中时空表征的性能。核心思想是通过训练模型预测手部位置、物体位置以及(在可用时)利用配对字幕预测物体的语义标签,从而增强训练过程中的物体感知能力。在推理阶段,模型仅需RGB帧作为输入,便能追踪和定位物体(尽管并未针对此任务进行显式训练)。我们通过以下方式展示了模型所学物体感知表征的性能:(i)在多个下游视频-文本检索和分类基准上,通过零样本测试评估其强迁移能力;(ii)将所学表征作为输入用于长期视频理解任务(例如Ego4D中的情景记忆)。在所有情况下,性能均优于现有最先进技术——即使与使用更大批量训练的模型相比也是如此。我们还表明,在训练中使用噪声图像级检测作为伪标签,模型能通过视频一致性学习生成更优的边界框,并能够定位相关文本描述中的词汇。总体而言,我们证明该模型可作为自我中心视频模型的即插即用替代方案,通过视觉-文本定位提升性能。