The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed \model{} achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, \model{} delivers an absolute improvement of 2.7\% over LLaVA-OneVision on VideoMME and 10.7\% on MuirBench. Codes are available at https://github.com/Hon-Wong/ByteVideoLLM
翻译:大型视觉语言模型(LVLM)在图像与视频分析领域的应用是一个令人兴奋且快速发展的研究方向。近年来,用于微调图像理解的高质量图文数据集显著增长,但针对视频的同类数据集仍然匮乏。此外,许多VideoLLM是单图像VLM的扩展,可能无法高效处理长视频的复杂性。本研究引入了一个基于专有模型生成的大规模合成数据集,通过精心设计的提示词处理广泛的问题类型。我们还探索了一种动态视觉标记压缩架构,在计算效率与性能之间取得平衡。我们提出的\model{}在多种视频任务上取得了最先进的结果,并展现出优异的泛化能力,为多图像理解建立了新的基准。值得注意的是,\model{}在VideoMME上较LLaVA-OneVision实现了2.7%的绝对性能提升,在MuirBench上提升了10.7%。代码已发布于 https://github.com/Hon-Wong/ByteVideoLLM