Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
翻译:推进机器智能需要发展跨多模态的感知能力,正如人类感知世界的方式。我们介绍OmniVinci,一项旨在构建强大、开源的全模态大语言模型的计划。我们深入研究了模型架构与数据策展方面的设计选择。在模型架构方面,我们提出了三项关键创新:(i) OmniAlignNet,用于在共享的全模态潜在空间中增强视觉与音频嵌入之间的对齐;(ii) 时序嵌入分组,用于捕捉视觉与音频信号之间的相对时序对齐;(iii) 约束旋转时间嵌入,用于在全模态嵌入中编码绝对时序信息。我们引入了一个策展与合成流程,生成了2400万条单模态与全模态对话。我们发现,各模态在感知与推理方面能够相互增强。我们的模型OmniVinci在多项评测中表现优异:在DailyOmni(跨模态理解)上超过Qwen2.5-Omni 19.05分,在MMAR(音频)上超过1.7分,在Video-MME(视觉)上超过3.9分,而训练token量仅为0.2T——相比Qwen2.5-Omni的1.2T减少了6倍。最后,我们展示了OmniVinci在机器人、医疗人工智能和智能工厂等下游应用中的全模态优势。