In this work, we develop a prompting approach for incremental summarization of task videos. We develop a sample-efficient few-shot approach for extracting semantic concepts as an intermediate step. We leverage an existing model for extracting the concepts from the images and extend it to videos and introduce a clustering and querying approach for sample efficiency, motivated by the recent advances in perceiver-based architectures. Our work provides further evidence that an approach with richer input context with relevant entities and actions from the videos and using these as prompts could enhance the summaries generated by the model. We show the results on a relevant dataset and discuss possible directions for the work.
翻译:在本项工作中,我们开发了一种用于任务视频增量式摘要的提示方法。我们提出了一种样本高效的少样本方法,将语义概念提取作为中间步骤。我们利用现有模型从图像中提取概念,并将其扩展到视频领域,同时引入了一种基于聚类和查询的方法以提高样本效率,这一方法借鉴了近期基于感知器架构的进展。我们的工作进一步证明,通过引入包含视频中相关实体和动作的丰富输入上下文,并利用这些信息作为提示,可以增强模型生成的摘要质量。我们在相关数据集上展示了实验结果,并讨论了该工作的未来发展方向。