Generating captions for long and complex videos is both critical and challenging, with significant implications for the growing fields of text-to-video generation and multi-modal understanding. One key challenge in long video captioning is accurately recognizing the same individuals who appear in different frames, which we refer to as the ID-Matching problem. Few prior works have focused on this important issue. Those that have, usually suffer from limited generalization and depend on point-wise matching, which limits their overall effectiveness. In this paper, unlike previous approaches, we build upon LVLMs to leverage their powerful priors. We aim to unlock the inherent ID-Matching capabilities within LVLMs themselves to enhance the ID-Matching performance of captions. Specifically, we first introduce a new benchmark for assessing the ID-Matching capabilities of video captions. Using this benchmark, we investigate LVLMs containing GPT-4o, revealing key insights that the performance of ID-Matching can be improved through two methods: 1) enhancing the usage of image information and 2) increasing the quantity of information of individual descriptions. Based on these insights, we propose a novel video captioning method called Recognizing Identities for Captioning Effectively (RICE). Extensive experiments including assessments of caption quality and ID-Matching performance, demonstrate the superiority of our approach. Notably, when implemented on GPT-4o, our RICE improves the precision of ID-Matching from 50% to 90% and improves the recall of ID-Matching from 15% to 80% compared to baseline. RICE makes it possible to continuously track different individuals in the captions of long videos.
翻译:为长而复杂的视频生成描述既至关重要又充满挑战,这对日益增长的文本到视频生成和多模态理解领域具有重大意义。长视频描述中的一个关键挑战是准确识别出现在不同帧中的同一人物,我们称之为身份匹配问题。先前很少有研究关注这一重要问题。已有的方法通常泛化能力有限,并依赖于逐点匹配,这限制了其整体有效性。本文中,与先前方法不同,我们基于大型视觉语言模型(LVLMs)构建,以利用其强大的先验知识。我们的目标是解锁LVLMs自身固有的身份匹配能力,以提升描述中的身份匹配性能。具体而言,我们首先引入了一个用于评估视频描述身份匹配能力的新基准。利用该基准,我们对包含GPT-4o在内的LVLMs进行了研究,揭示了关键见解:身份匹配的性能可以通过两种方法得到提升:1)增强图像信息的使用;2)增加个体描述的信息量。基于这些见解,我们提出了一种新颖的视频描述方法,称为有效识别身份的描述方法(RICE)。包括描述质量和身份匹配性能评估在内的广泛实验证明了我们方法的优越性。值得注意的是,在GPT-4o上实现时,与基线相比,我们的RICE将身份匹配的精确度从50%提升至90%,并将召回率从15%提升至80%。RICE使得在长视频描述中持续追踪不同个体成为可能。