In-Network Collective (INC) acceleration holds immense potential for optimizing AI training and inference; however, its cross-layer nature has historically hindered investment and adoption within the open Ethernet ecosystem. To bridge this gap, we propose EPIC (Ethernet Polymorphic In-network Collective), an INC protocol specification and reference system built on the principle of "Unified Abstraction, Polymorphic Realization." EPIC introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities. We address three fundamental challenges: first, we employ a modular design that enables an evolutionary path from simple to complex implementations, allowing vendors to iterate their hardware incrementally; second, we apply formal verification methodologies to prove the correctness of all proposed polymorphic modes; and third, we develop a unified resource management model versatile enough for diverse INC scenarios. Extensive validation -- spanning model checking, packet/flow simulations, VM emulation, Tofino Testbed, and FPGA/RTL verification -- confirms EPIC's correctness, performance gain, and feasibility.
翻译:网络内集合通信(In-Network Collective, INC)加速对优化AI训练与推理具有巨大潜力,但其跨层特性历来阻碍了开放以太网生态中的投资与采用。为弥补这一差距,我们提出EPIC(以太网多态网络内集合通信)协议规范与参考系统,其核心原则为“统一抽象,多态实现”。EPIC引入兼容标准以太网的抽象机制,将功能边界与参与者角色对齐,同时提供针对不同硬件能力定制的多态实现方案。我们解决了三个根本性挑战:首先,采用模块化设计构建从简单到复杂实现的演进路径,使供应商能够渐进式迭代硬件;其次,应用形式化验证方法证明所有提出的多态模式的正确性;第三,开发出具有足够通用性以适应多样化INC场景的统一资源管理模型。全面的验证——涵盖模型检验、包/流仿真、虚拟机模拟、Tofino测试平台及FPGA/RTL验证——证实了EPIC的正确性、性能提升及可行性。