Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.
翻译:混合专家模型(Mixture-of-Experts, MoE)已成为前沿语言模型的主流架构。为满足此需求,生产级框架历经多年工程实践构建了优化的MoE训练栈。然而,针对新架构与系统优化迭代这些训练栈仍成本高昂。随着AI编码智能体的兴起,其有望实现训练框架开发流程的自动化,从而加速这一演进过程。但将智能体应用于现有框架将产生隐形成本,这类成本在现今仅关注吞吐量的评估体系中难以显现。我们将这一缺失维度定义为智能体任务效率(Agent-Task Efficiency, ATE):即使用编码智能体理解、操作与扩展框架所需成本。基于四项原生智能体设计原则,我们构建了PithTrain——一个紧凑型原生智能体MoE训练框架。同时提出ATE-Bench基准测试,涵盖真实训练框架任务。评估表明:PithTrain在吞吐量指标上可比肩生产级框架,且在ATE-Bench测试中,PithTrain能够实现更高的智能体任务效率——智能体交互次数减少高达62%,有效GPU使用时间缩减64%。