Dense video captioning is the task that involves the detection and description of events within video sequences. While traditional approaches focus on offline solutions where the entire video of analysis is available for the captioning model, in this work we introduce a paradigm shift towards Live Video Captioning (LVC). In LVC, dense video captioning models must generate captions for video streams in an online manner, facing important constraints such as having to work with partial observations of the video, the need for temporal anticipation and, of course, ensuring ideally a real-time response. In this work we formally introduce the novel problem of LVC and propose new evaluation metrics tailored for the online scenario, demonstrating their superiority over traditional metrics. We also propose an LVC model integrating deformable transformers and temporal filtering to address the LVC new challenges. Experimental evaluations on the ActivityNet Captions dataset validate the effectiveness of our approach, highlighting its performance in LVC compared to state-of-the-art offline methods. Results of our model as well as an evaluation kit with the novel metrics integrated are made publicly available to encourage further research on LVC.
翻译:密集视频描述生成是一项涉及视频序列中事件检测与描述的任务。传统方法主要关注离线解决方案,即整个待分析视频对描述模型完全可用。本工作引入了一种向实时视频描述生成(LVC)的范式转变。在LVC中,密集视频描述模型必须以在线方式为视频流生成描述,面临多重重要约束:必须基于视频的部分观测片段进行处理、需要进行时序预测,当然还需确保理想的实时响应。本研究正式提出了LVC这一新颖问题,并针对在线场景设计了新的评估指标,证明了其相对于传统指标的优越性。同时提出了一种集成可变形Transformer与时域滤波的LVC模型,以应对LVC带来的新挑战。在ActivityNet Captions数据集上的实验评估验证了本方法的有效性,突显了其在LVC任务中相较于最先进离线方法的性能优势。我们公开了模型结果及集成新指标的评估工具包,以促进LVC领域的进一步研究。