Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames. Our model uses a novel autoregressive factorized decoding architecture, which models the sequence of visual features for each time segment, outputting localized descriptions and efficiently leverages the context from the previous video segments. This allows the model to output frequent, detailed captions to more comprehensively describe the video, according to its actual local content, rather than mimic the training data. Second, we propose an optimization for efficient training and inference, which enables scaling to longer videos. Our approach shows excellent performance compared to both offline and online methods, and uses 20\% less compute. The annotations produced are much more comprehensive and frequent, and can further be utilized in automatic video tagging and in large-scale video data harvesting.
翻译:为视频生成能够准确描述其内容的自动密集描述仍是研究中的一个挑战领域。当前大多数模型需要一次性处理整个视频。相反,我们提出了一种高效的在线方法,该方法无需访问未来帧即可输出频繁、详细且时间对齐的描述。我们的模型采用了一种新颖的自回归因子化解码架构,该架构对每个时间段的视觉特征序列进行建模,输出局部化描述,并高效利用先前视频片段的上下文信息。这使得模型能够根据视频的实际局部内容,输出频繁且详细的描述,从而更全面地描述视频,而非简单地模仿训练数据。其次,我们提出了一种用于高效训练和推理的优化方法,使其能够扩展到更长的视频。与离线和在线方法相比,我们的方法表现出优异的性能,并减少了20%的计算量。生成的标注更为全面和频繁,可进一步用于自动视频标记和大规模视频数据采集。