In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.
翻译:在本报告中,我们介绍了 ERNIE 5.0,一个原生自回归的基础模型,旨在实现跨文本、图像、视频和音频的统一多模态理解与生成。所有模态均基于一个超稀疏的专家混合(MoE)架构,采用模态无关的专家路由机制,在统一的“下一组令牌预测”目标下从头开始训练。为应对不同资源约束下大规模部署的实际挑战,ERNIE 5.0 采用了一种新颖的弹性训练范式。在单次预训练运行中,模型学习了一系列具有不同深度、专家容量和路由稀疏度的子模型,从而能够在内存或时间受限的场景下,灵活权衡性能、模型大小和推理延迟。此外,我们系统地解决了将强化学习扩展到统一基础模型所面临的挑战,从而确保了在超稀疏 MoE 架构和多样化多模态设置下高效稳定的后训练。大量实验表明,ERNIE 5.0 在多种模态上均实现了强大且均衡的性能。据我们所知,在已公开披露的模型中,ERNIE 5.0 是首个支持多模态理解与生成、达到万亿参数规模的生产级统一自回归模型。为促进进一步研究,我们展示了统一模型中模态无关专家路由的详细可视化结果,并对弹性训练进行了全面的实证分析,旨在为该领域提供深刻的见解。