Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.
翻译:信息以多种模态呈现。多模态原生AI模型对于整合现实世界信息并提供全面理解至关重要。尽管存在专有的多模态原生模型,但其缺乏开放性为采用乃至适配设置了障碍。为填补这一空白,我们推出了Aria,一种开放的多模态原生模型,在广泛的多模态、语言和编码任务中具备同类最佳性能。Aria是一种混合专家模型,每个视觉标记和文本标记分别激活39亿和35亿参数。它在多项多模态任务上超越了Pixtral-12B和Llama3.2-11B,并与最佳专有模型相竞争。我们通过四阶段训练流程从头预训练Aria,逐步赋予模型强大的语言理解、多模态理解、长上下文窗口和指令跟随能力。我们开源了模型权重及代码库,以促进Aria在实际应用中的便捷采用与适配。