Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). The previous approaches for VLMMs involved Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and adding additional learnable modules. Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data compared to text-only data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback in order to enrich the understanding of video content. Demonstrating enhanced performance across diverse video benchmarks, our multimodal RLAIF approach, VLM-RLAIF, outperforms existing approaches, including the SFT model. We commit to open-sourcing our code, models, and datasets to foster further research in this area.
翻译:近期大型语言模型的进展推动了视频大型多模态模型(VLMMs)的发展。先前针对VLMMs的方法主要涉及基于指令微调数据集的监督微调(SFT)、将LLM与视觉编码器集成,以及添加额外可学习模块。然而,视频与文本的多模态对齐仍具挑战性,这主要源于多模态指令微调数据相较于纯文本数据在数量和质量上的不足。我们提出了一种名为"基于人工智能反馈的强化学习"(RLAIF)的新型对齐策略,该策略利用多模态AI系统进行自我监督:通过提供自我偏好反馈进行自我优化,从而促进视频与文本模态的对齐。具体而言,我们提出了上下文感知的奖励建模方法——在生成偏好反馈时提供详细的视频描述作为上下文,以增强对视频内容的理解。我们的多模态RLAIF方法VLM-RLAIF在多种视频基准测试中展现出卓越性能,显著优于包括SFT模型在内的现有方法。我们承诺开源代码、模型及数据集,以推动该领域的进一步研究。