Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment. Although there have been recent advances, generating detailed and enriched video descriptions continues to be a substantial challenge. In this work, we introduce Video ChatCaptioner, an innovative approach for creating more comprehensive spatiotemporal video descriptions. Our method employs a ChatGPT model as a controller, specifically designed to select frames for posing video content-driven questions. Subsequently, a robust algorithm is utilized to answer these visual queries. This question-answer framework effectively uncovers intricate video details and shows promise as a method for enhancing video content. Following multiple conversational rounds, ChatGPT can summarize enriched video content based on previous conversations. We qualitatively demonstrate that our Video ChatCaptioner can generate captions containing more visual details about the videos. The code is publicly available at https://github.com/Vision-CAIR/ChatCaptioner
翻译:视频字幕生成旨在通过自然语言传达视频中的动态场景,促进对环境时空信息的理解。尽管近期取得了一些进展,但生成详细且丰富的视频描述仍是一项重大挑战。在这项工作中,我们提出了视频聊天字幕生成器(Video ChatCaptioner),这是一种用于创建更全面时空视频描述的创新方法。我们的方法使用ChatGPT模型作为控制器,专门用于选择帧以提出视频内容驱动的问题。随后,采用稳健的算法来回答这些视觉查询。这一问答框架有效地揭示了视频的复杂细节,并显示出作为增强视频内容方法的前景。经过多轮对话后,ChatGPT能够基于先前的对话总结出更丰富的视频内容。我们定性表明,我们的视频聊天字幕生成器能够生成包含更多视频视觉细节的字幕。代码已在 https://github.com/Vision-CAIR/ChatCaptioner 公开提供。