TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

Norman P. Jouppi,George Kurian,Sheng Li,Peter Ma,Rahul Nagarajan,Lifeng Nai,Nishant Patil,Suvinay Subramanian,Andy Swing,Brian Towles,Cliff Young,Xiang Zhou,Zongwei Zhou,David Patterson

from arxiv, 15 pages; 16 figures; to be published at ISCA 2023 (the International Symposium on Computer Architecture)

In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) dynamically reconfigure its interconnect topology to improve scale, availability, utilization, modularity, deployment, security, power, and performance; users can pick a twisted 3D torus topology if desired. Much cheaper, lower power, and faster than Infiniband, OCSes and underlying optical components are <5% of system cost and <3% of system power. Each TPU v4 includes SparseCores, dataflow processors that accelerate models that rely on embeddings by 5x-7x yet use only 5% of die area and power. Deployed since 2020, TPU v4 outperforms TPU v3 by 2.1x and improves performance/Watt by 2.7x. The TPU v4 supercomputer is 4x larger at 4096 chips and thus ~10x faster overall, which along with OCS flexibility helps large language models. For similar sized systems, it is ~4.3x-4.5x faster than the Graphcore IPU Bow and is 1.2x-1.7x faster and uses 1.3x-1.9x less power than the Nvidia A100. TPU v4s inside the energy-optimized warehouse scale computers of Google Cloud use ~3x less energy and produce ~20x less CO2e than contemporary DSAs in a typical on-premise data center.

翻译：为应对机器学习模型创新，实际生产负载发生了根本性且迅猛的变化。TPU v4是谷歌第五款领域专用架构，也是其面向此类机器学习模型打造的第三代超级计算机。光学电路交换机动态重构其互连拓扑结构，以提升规模、可用性、利用率、模块化程度、部署便捷性、安全性、功耗及性能；用户亦可按需选择扭曲三维环形拓扑。相较于InfiniBand，光学电路交换机及其底层光学组件成本降低逾75%、功耗降低逾97%，且速度更快，其成本仅占系统总成本不足5%，功耗不足系统总功耗的3%。每块TPU v4集成稀疏核处理器，该数据流处理器仅占用5%芯片面积与功耗，即可将依赖嵌入技术的模型加速5-7倍。自2020年部署以来，TPU v4性能较TPU v3提升2.1倍，能效比提升2.7倍。TPU v4超级计算机规模扩大4倍至4096芯片，整体性能提升约10倍，配合光学电路交换机灵活性有效支撑大语言模型。在同等规模系统中，其性能较Graphcore IPU Bow快约4.3-4.5倍，较Nvidia A100快1.2-1.7倍且功耗降低1.3-1.9倍。运行于谷歌云优化能效的仓库级数据中心中，TPU v4的能耗约为典型本地部署领域专用架构的1/3，二氧化碳当量排放量减少约20倍。