The rapid growth in compute demand from artificial intelligence (AI) has driven a massive surge in data center construction, precipitating an energy and sustainability crisis. Motivated by the abundant solar energy in outer space and the recent sharp reduction in space launch costs, orbital data centers are emerging as a potential pathway for the future scaling of AI compute infrastructure. While the cold background in vacuum seems appealing for cooling, computing systems operating in space without convection ultimately rely on radiative cooling, requiring large-area radiators. Such limitations in thermal management pose a significant challenge for deploying the standard liquid/air-cooled computers in space. In this work, we investigate the impact of the thermal constraints in space on both graphics processing units (GPUs) with high-bandwidth memory (HBM) and the emerging compute-in-memory (CIM) accelerators. We develop a radiator-in-the-loop co-design methodology that directly links the permitted system TOPS (terra-operations per second) with the practical radiator cooling capacity in space. Our thermal simulations reveal that the separately located GPU die and HBMs create severe thermal hotspots under limited radiator capacity, necessitating GPU thermal throttling. In contrast, CIM accelerators exhibit a much more uniform heat distribution and consistently outperform GPUs in TOPS/W across a wide range of radiator budgets. We systematically evaluated the performance of CIM and GPU across various AI workloads and demonstrated that CIM has a magnified advantage for deployment in space under realistic thermal constraints.
翻译:人工智能计算需求的快速增长推动了数据中心大规模建设,由此引发了能源与可持续性危机。受太空中丰富的太阳能资源以及近期太空发射成本急剧下降的驱动,轨道数据中心正逐渐成为未来AI计算基础设施扩展的潜在路径。尽管真空环境中的低温背景看似有利于冷却,但在太空中运行的计算系统由于缺乏对流作用,最终只能依赖辐射冷却,这需要大面积散热器。这种热管理限制为在太空中部署标准液冷/风冷计算机带来了重大挑战。本研究探讨了太空热约束对配备高带宽内存的图形处理器以及新兴的计算内存加速器的影响。我们开发了一种散热器在环协同设计方法,将允许的系统TOPS与太空中实际散热器冷却能力直接关联。热仿真结果表明,在有限散热器容量条件下,分离布局的GPU芯片与HBM会产生严重热热点,迫使GPU进行热节流。相比之下,CIM加速器展现出更均匀的热分布,并在广泛散热器预算范围内始终在TOPS/W指标上优于GPU。我们系统评估了CIM与GPU在各类AI工作负载上的性能,证明在实际热约束条件下,CIM在太空部署中具有更显著的性能优势。