We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source.
翻译:本文提出X-VILA,一种旨在通过整合图像、视频和音频模态来扩展大语言模型(LLMs)能力的全模态模型。通过将特定模态编码器与LLM输入对齐、扩散解码器与LLM输出对齐,X-VILA实现了跨模态理解、推理与生成。为促进这种跨模态对齐,我们构建了有效的交错式任意模态指令跟随数据集。此外,我们发现当前跨模态对齐方法存在导致视觉信息丢失的显著问题。针对该问题,我们提出了包含视觉嵌入高速模块的视觉对齐机制。随后我们介绍了X-VILA的资源高效训练方案,该方案在任意模态对话中展现出卓越能力,大幅超越现有方法。即使在没有类似训练数据的情况下,X-VILA仍能展现跨模态的涌现特性。本项目将进行开源发布。