The push for greater efficiency in AI computation has given rise to an array of accelerator architectures that increasingly challenge the GPU's long-standing dominance. In this work, we provide a quantitative view of this evolving landscape of AI accelerators, including the Cerebras CS-3, SambaNova SN-40, Groq, Gaudi, and TPUv5e platforms, and compare against both NVIDIA (A100, H100) and AMD (MI-300X) GPUs. We evaluate key trade-offs in latency, throughput, power consumption, and energy-efficiency across both (i) end-to-end workloads and (ii) benchmarks of individual computational primitives. Notably, we find the optimal hardware platform varies across batch size, sequence length, and model size, revealing a large underlying optimization space. Our analysis includes detailed power measurements across the prefill and decode phases of LLM inference, as well as quantification of the energy cost of communication. We additionally find that Cerebras, SambaNova, and Gaudi have 10-60% higher idle power than NVIDIA and AMD GPUs, emphasizing the importance of high utilization in order to realize promised efficiency gains. Finally, we assess programmability across platforms based on our experiments with real profiled workloads, comparing the compilation times and software stack maturity required to achieve promised performance.
翻译:人工智能计算追求更高效率的推动催生了大量加速器架构,这些架构日益挑战GPU长期以来的主导地位。本文对AI加速器的演化格局进行了量化分析,涵盖Cerebras CS-3、SambaNova SN-40、Groq、Gaudi和TPUv5e平台,并与NVIDIA(A100、H100)及AMD(MI-300X)GPU进行对比。我们评估了(i)端到端工作负载与(ii)单个计算原语基准测试中延迟、吞吐量、功耗和能效的关键权衡。值得注意的是,我们发现最优硬件平台随批处理大小、序列长度和模型规模而变化,揭示了巨大的底层优化空间。分析包含LLM推理预填充和解码阶段的详细功耗测量,以及通信能量成本的量化。此外,我们发现Cerebras、SambaNova和Gaudi的空闲功耗比NVIDIA和AMD GPU高10-60%,凸显了高利用率对于实现预期效率提升的重要性。最后,基于真实剖析工作负载的实验,我们评估了各平台的可编程性,比较了实现预期性能所需的编译时间和软件栈成熟度。