In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo.
翻译:本报告介绍了我们在CVPR 2024 EgoVis挑战赛中的解决方案,涵盖Ego4D挑战赛的五个赛道以及EPIC-Kitchens挑战赛的三个赛道。基于视频-语言双塔模型架构,并利用我们精心组织的自我中心视角视频数据,我们提出了一个名为EgoVideo的新型基础模型。该模型专门针对自我中心视频的独特特性设计,为我们的竞赛提交提供了有力支持。在Ego4D挑战中,我们处理了包括自然语言查询、步骤定位、时刻查询、短期物体交互预测及长期动作预测在内的多项任务。此外,我们还参与了EPIC-Kitchens挑战赛,涉及动作识别、多实例检索以及动作识别的领域适应三个赛道。通过将EgoVideo适配至这些多样化任务,我们展示了其在不同自我中心视频分析场景中的通用性与有效性,证明了EgoVideo作为自我中心基础模型强大的表征能力。我们的代码库与预训练模型已公开于https://github.com/OpenGVLab/EgoVideo。