In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo.
翻译:本报告介绍了我们在CVPR 2024 EgoVis挑战赛中的解决方案,涵盖了Ego4D挑战赛的五个赛道以及EPIC-Kitchens挑战赛的三个赛道。基于视频-语言双塔模型,并利用我们精心整理的第一人称视角视频数据,我们提出了一种名为EgoVideo的新型基础模型。该模型专门针对第一人称视角视频的独特特性设计,为我们的竞赛提交提供了有力支持。在Ego4D挑战赛中,我们处理了包括自然语言查询、步骤定位、时刻查询、短期物体交互预测以及长期动作预测在内的多项任务。此外,我们还参与了EPIC-Kitchens挑战赛,涉及动作识别、多实例检索以及动作识别的领域适应等赛道。通过将EgoVideo适配于这些多样化任务,我们展示了其在不同第一人称视角视频分析场景中的通用性和有效性,证明了EgoVideo作为第一人称视角基础模型强大的表征能力。我们的代码库与预训练模型已在https://github.com/OpenGVLab/EgoVideo 公开。