Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However, existing methods encounter challenges in effectively handling both image and video understanding, particularly with limited visual tokens. In this work, we introduce Chat-UniVi, a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically, we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover, we leverage a multi-scale representation, enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably, Chat-UniVi is trained on a mixed dataset containing both images and videos, allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi consistently outperforms even existing methods exclusively designed for either images or videos. Code is available at https://github.com/PKU-YuanGroup/Chat-UniVi.
翻译:大语言模型在各类开放式任务中展现出惊人的通用能力,其应用已扩展至多模态对话领域。然而现有方法在处理图像与视频理解时仍面临挑战,尤其在视觉标记数量受限的情况下。本文提出Chat-UniVi——一种通过统一视觉表示实现对图像与视频理解及对话交互的统一视觉语言模型。具体而言,我们采用一组动态视觉标记统一表示图像与视频。该表示框架使模型能够高效利用有限的视觉标记,同步捕捉图像所需的空间细节与视频所需的完整时序关系。此外,我们利用多尺度表示机制,使模型既可感知高层语义概念,又能捕捉低层视觉细节。值得注意的是,Chat-UniVi在包含图像与视频的混合数据集上训练,可直接应用于涉及两种媒介的任务而无需任何修改。大量实验结果表明,Chat-UniVi在图像与视频特定任务上均持续超越现有专门方法。相关代码已开源至https://github.com/PKU-YuanGroup/Chat-UniVi。