In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID
翻译:本文提出了一种新颖方法,用于解决视觉语言模型在视频与图像理解任务中的令牌生成挑战,称为LLaMA-VID。当前视觉语言模型虽在图像描述和视觉问答等任务中表现优异,但在处理长视频时因视觉令牌过多而面临计算负担。LLaMA-VID通过为每一帧表示两个不同的令牌——即上下文令牌和内容令牌——来解决这一问题。上下文令牌基于用户输入编码图像的整体上下文,而内容令牌则封装了每一帧中的视觉线索。这种双令牌策略在保留关键信息的同时,显著减少了长视频的负载。总体而言,LLaMA-VID使现有框架能够支持小时级视频,并通过额外的上下文令牌提升了其上限。实验证明,该方法在大多数视频类或图像类基准测试中超越了以往方法。代码已开源:https://github.com/dvlab-research/LLaMA-VID