Robotics, autonomous driving, augmented reality, and many embodied computer vision applications must quickly react to user-defined events unfolding in real time. We address this setting by proposing a novel task for multimodal video understanding-Streaming Detection of Queried Event Start (SDQES). The goal of SDQES is to identify the beginning of a complex event as described by a natural language query, with high accuracy and low latency. We introduce a new benchmark based on the Ego4D dataset, as well as new task-specific metrics to study streaming multimodal detection of diverse events in an egocentric video setting. Inspired by parameter-efficient fine-tuning methods in NLP and for video tasks, we propose adapter-based baselines that enable image-to-video transfer learning, allowing for efficient online video modeling. We evaluate three vision-language backbones and three adapter architectures on both short-clip and untrimmed video settings.
翻译:机器人学、自动驾驶、增强现实以及众多具身计算机视觉应用必须对实时展开的用户定义事件作出快速响应。针对这一场景,我们提出了一项多模态视频理解新任务——流式查询事件起始检测(SDQES)。该任务的目标是以高精度和低延迟识别自然语言查询所描述的复杂事件的起始时刻。基于Ego4D数据集,我们构建了新的基准测试体系,并设计了针对该任务的专用评估指标,用以研究以自我为中心的视频场景中多样化事件的流式多模态检测。受自然语言处理及视频任务中参数高效微调方法的启发,我们提出了基于适配器的基线模型,实现从图像到视频的迁移学习,从而支持高效的在线视频建模。我们在短片段视频和未修剪视频两种设置下,评估了三种视觉语言骨干网络与三种适配器架构的性能。