Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computation potential of many-core CPU platforms. Many-core CPUs are widely deployed in web servers and high-end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross-NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many-core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross-node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at https://github.com/OpenBMB/ArcLight.
翻译:尽管现有面向CPU的大语言模型推理框架已较为成熟,但它们未能充分挖掘众核CPU平台的计算潜力。众核CPU广泛应用于网络服务器与高端网络设备,通常组织为多个将核心与内存分组的NUMA节点。现有框架普遍忽视跨NUMA内存访问的巨大开销,限制了此类平台上的推理可扩展性与智能化能力。针对这一局限,我们从头构建了ArcLight——一种专为众核CPU设计的轻量级大语言模型推理架构。ArcLight集成了高效内存管理与线程调度机制,并引入精细控制的张量并行策略以缓解跨节点内存访问瓶颈。实验结果表明,ArcLight显著突破了主流框架的性能上限,推理吞吐量最高可提升46%。此外,ArcLight保持与任意CPU设备的兼容性。ArcLight代码已在https://github.com/OpenBMB/ArcLight开源。