In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating Visual Capabilities while Maintaining Complex Language Understanding: In addition to the vision encoder, we incorporated the Visual Attention Experts into the LLM to enhance its visual capabilities. (ii) Extending Context Window for High-Resolution and Long-Duration Visual Feature: We explore the Visual Merger Module to effectively reduce the token number of high-resolution images and incorporated frame position ids to avoid position interpolation. (iii) High-Quality Bilingual Datasets: We meticulously curated and filtered a high-quality bilingual multimodal dataset to reduce visual hallucinations. With above recipe we build MammothModa that consistently outperforms the state-of-the-art models, e.g., LLaVA-series, across main real-world visual language benchmarks without bells and whistles.
翻译:在本报告中,我们介绍了MammothModa,这是另一个旨在从基础基线出发实现最先进性能的多模态大语言模型(MLLM)。我们聚焦于三个关键设计思路:(i)集成视觉能力同时保持复杂语言理解:除了视觉编码器,我们在LLM中引入了视觉注意力专家模块以增强其视觉能力。(ii)扩展上下文窗口以处理高分辨率与长序列视觉特征:我们探索了视觉融合模块以有效减少高分辨率图像的令牌数量,并引入了帧位置标识以避免位置插值。(iii)高质量双语数据集:我们精心筛选并构建了一个高质量的双语多模态数据集,以减少视觉幻觉。基于以上方案,我们构建的MammothModa在主要现实世界视觉语言基准测试中(例如LLaVA系列模型),无需额外技巧即可持续超越现有最先进模型。