The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets. Existing video instruction tuning datasets lack diversity as they are derived by prompting large language models with video captions to generate question-answer pairs, and are therefore mostly descriptive. Meanwhile, many labeled video datasets with diverse labels and supervision exist - however, we find that their integration into LVLMs is non-trivial. Herein, we present Video Self-Training with augmented Reasoning (Video-STaR), the first video self-training approach. Video-STaR allows the utilization of any labeled video dataset for video instruction tuning. In Video-STaR, an LVLM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LVLMs to novel downstream tasks with existing supervision. During generation, an LVLM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LVLM is then re-trained on the generated dataset. By only training on generated answers that contain the correct video labels, Video-STaR utilizes these existing video labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in (I) general video QA, where TempCompass performance improved by 10%, and (II) on downstream tasks, where Video-STaR improved Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.
翻译:大型视觉语言模型(LVLMs)的性能取决于其训练数据的规模与质量。现有的视频指令调优数据集多样性不足,因其主要通过使用视频描述提示大语言模型生成问答对,内容多为描述性。与此同时,存在许多具有多样化标签与监督信息的标注视频数据集——然而我们发现将其整合至LVLMs并非易事。本文提出增强推理的视频自训练方法(Video-STaR),这是首个视频自训练框架。Video-STaR能够利用任何标注视频数据集进行视频指令调优。在该框架中,LVLM在指令生成与微调之间循环迭代,我们证明该方法:(I)提升通用视频理解能力;(II)使LVLM能基于现有监督适应新颖的下游任务。在生成阶段,LVLM被提示生成答案,随后仅筛选出包含原始视频标签的答案,并基于生成的数据集重新训练模型。通过仅训练包含正确视频标签的生成答案,Video-STaR将这些现有视频标签作为视频指令调优的弱监督信号。实验结果表明,经Video-STaR增强的LVLMs在以下方面表现提升:(I)通用视频问答任务中,TempCompass性能提升10%;(II)下游任务中,Kinetics700-QA准确率提升20%,FineDiving的动作质量评估指标提升15%。