Thermal management is a major challenge in next-generation high-performance computing systems, particularly for heterogeneous multi-chip packages such as the NVIDIA GB200 Grace Blackwell Superchip. In this work, a physics-based computational framework is developed to optimize embedded cooling channel layouts for high-power multi-chip modules. The model couples steady-state heat conduction with a porous media-based representation of coolant transport, coupled with a row-wise coolant energy balance, to estimate chip temperature fields within microchannel networks. Unlike conventional designs, an interdigitated cooling architecture is parameterized using geometric variables, including channel count, width, and expansion over chip regions, enabling systematic design exploration. To enable efficient optimization, a surrogate-based approach is employed to approximate the relationship between geometric parameters and temperature metrics. The resulting model is optimized using a mixed-integer quadratic programming algorithm to minimize a weighted objective based on peak and average chip temperatures. To improve physical relevance, channel placement is further constrained to increase cooling coverage near GPU regions, where thermal loads are highest. The framework is applied to a representative multi-chip configuration based on NVIDIA GB200 architecture, consisting of two graphics processing units and one central processing unit. The results demonstrate that the optimal design reduces the peak chip temperature by 140.45°C and the average chip temperature by 35.87°C compared to the baseline configuration.
翻译:热管理是下一代高性能计算系统面临的主要挑战,特别是对于异构多芯片封装,如NVIDIA GB200 Grace Blackwell超级芯片。本文开发了一个基于物理的计算框架,用于优化高功率多芯片模组的嵌入式冷却通道布局。该模型将稳态热传导与基于多孔介质的冷却剂输运表示相结合,并加入行向冷却剂能量平衡,以估算微通道网络中的芯片温度场。与传统设计不同,本文采用交错冷却架构,通过几何变量(包括通道数量、宽度及芯片区域上的扩展)进行参数化,从而实现系统化的设计探索。为实现高效优化,采用基于代理模型的方法来近似几何参数与温度指标之间的关系。利用混合整数二次规划算法对所得模型进行优化,以最小化基于芯片峰值温度和平均温度的加权目标函数。为提高物理相关性,进一步约束通道布局,以增加GPU区域(热负载最高处)附近的冷却覆盖。该框架应用于基于NVIDIA GB200架构的代表性多芯片配置,包含两个图形处理单元和一个中央处理单元。结果表明,与基准配置相比,优化设计将芯片峰值温度降低了140.45°C,平均芯片温度降低了35.87°C。