Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). The previous approaches for VLMMs involved Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and adding additional learnable modules. Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data compared to text-only data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback in order to enrich the understanding of video content. Demonstrating enhanced performance across diverse video benchmarks, our multimodal RLAIF approach, VLM-RLAIF, outperforms existing approaches, including the SFT model. We commit to open-sourcing our code, models, and datasets to foster further research in this area.
翻译:近期大型语言模型的进展推动了视频大型多模态模型(VLMMs)的发展。已有的VLMMs方法包括使用指令微调数据集进行监督式微调(SFT)、将LLM与视觉编码器集成,以及增加额外的可学习模块。视频与文本的多模态对齐仍面临挑战,主要原因是与纯文本数据相比,多模态指令微调数据的数量和质量均存在不足。我们提出了一种新颖的对齐策略——基于人工智能反馈的强化学习(RLAIF),该策略利用多模态AI系统进行自我监督,通过提供自我偏好反馈实现模型自我优化,促进视频与文本模态的对齐。具体而言,我们提出通过提供详细的视频描述作为偏好反馈生成过程中的上下文信息,构建上下文感知奖励建模,以增强对视频内容的理解。在多个视频基准测试中,我们的多模态RLAIF方法VLM-RLAIF展现出更优性能,超越了包括SFT模型在内的现有方法。我们将开源代码、模型与数据集,以推动该领域的进一步研究。