BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.

翻译：近年来，大型语言模型（LLM）的进展推动了图像-语言对话代理的多种发展，然而如何构建一个高效的视频对话系统仍在探索中。考虑到LLM和视觉主干网络的庞大规模，可用于实现有效时序建模的GPU内存所剩无几，而时序建模对于理解视频并给出反馈至关重要。为此，我们提出分支时序适配器（BT-Adapter），这是一种将图像-语言预训练模型扩展至视频领域的新方法。具体而言，BT-Adapter作为预训练视觉编码器旁的一个即插即用时序建模分支，在保持主干网络冻结的同时进行微调。仅需预训练一次，BT-Adapter即可无缝集成到所有使用该版本CLIP的图像对话模型中，从而实现无需视频指令的视频对话。此外，我们在分支内部设计了一种独特的不对称令牌掩码策略，并为BT-Adapter定制了训练任务，以促进更快的收敛和更好的效果。得益于BT-Adapter，我们能够以较低的GPU成本为现有的多模态对话模型赋予强大的视频理解能力。在未添加任何复杂技巧的情况下，BT-Adapter实现了：（1）在多种视频任务上取得最先进的零样本性能，同时使用的GPU时数减少数千小时；（2）在未进行任何视频指令微调的情况下，性能优于当前的视频聊天机器人；（3）在使用视频指令微调的视频聊天任务中取得最先进的结果，大幅超越之前的SOTA方法。