We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.
翻译:本文介绍了Arcee Trinity Large的技术报告,这是一个稀疏的专家混合模型,总参数量为4000亿,每令牌激活130亿参数。此外,我们还报告了Trinity Nano和Trinity Mini模型,其中Trinity Nano总参数量为60亿,每令牌激活10亿参数;Trinity Mini总参数量为260亿,每令牌激活30亿参数。这些模型采用现代架构,包括交错局部与全局注意力、门控注意力、深度缩放三明治归一化以及用于专家混合的Sigmoid路由机制。对于Trinity Large,我们还引入了一种名为软钳位动量专家偏置更新的新型MoE负载均衡策略。我们使用Muon优化器对模型进行训练。所有三个模型在训练过程中均未出现损失尖峰。Trinity Nano和Trinity Mini在10万亿令牌上进行了预训练,Trinity Large则在17万亿令牌上进行了预训练。模型检查点可在https://huggingface.co/arcee-ai获取。