Multi-GPU systems are becoming increasingly important in highperformance computing (HPC) and cloud infrastructure, providing acceleration for data-intensive applications, including machine learning workloads. These systems consist of multiple GPUs interconnected through high-speed networking links such as NVIDIA's NVLink. In this work, we explore whether the interconnect on such systems can offer a novel source of leakage, enabling new forms of covert and side-channel attacks. Specifically, we reverse engineer the operations of NVlink and identify two primary sources of leakage: timing variations due to contention and accessible performance counters that disclose communication patterns. The leakage is visible remotely and even across VM instances in the cloud, enabling potentially dangerous attacks. Building on these observations, we develop two types of covert-channel attacks across two GPUs, achieving a bandwidth of over 70 Kbps with an error rate of 4.78% for the contention channel. We develop two end-to-end crossGPU side-channel attacks: application fingerprinting (including 18 high-performance computing and deep learning applications) and 3D graphics character identification within Blender, a multi-GPU rendering application. These attacks are highly effective, achieving F1 scores of up to 97.78% and 91.56%, respectively. We also discover that leakage surprisingly occurs across Virtual Machines on the Google Cloud Platform (GCP) and demonstrate a side-channel attack on Blender, achieving F1 scores exceeding 88%. We also explore potential defenses such as managing access to counters and reducing the resolution of the clock to mitigate the two sources of leakage.
翻译:多GPU系统在高性能计算(HPC)和云基础设施中日益重要,为包括机器学习工作负载在内的数据密集型应用提供加速。此类系统由多个通过高速网络链路(如NVIDIA的NVLink)互连的GPU构成。本工作中,我们探究此类系统的互连是否可能成为新的信息泄漏源,从而催生新型隐蔽信道与侧信道攻击。具体而言,我们逆向分析了NVLink的运行机制,识别出两类主要泄漏源:由竞争导致的时序差异,以及可访问的、能揭示通信模式的性能计数器。该泄漏可被远程观测,甚至在云环境的虚拟机实例间亦存在,可能引发高危攻击。基于这些发现,我们开发了两种跨双GPU的隐蔽信道攻击,其中基于竞争的信道实现了超过70 Kbps的带宽及4.78%的误码率。我们还构建了两种端到端的跨GPU侧信道攻击:应用指纹识别(涵盖18种高性能计算与深度学习应用)以及在多GPU渲染应用Blender内的3D图形角色识别。这些攻击效果显著,分别取得了最高97.78%和91.56%的F1分数。我们意外发现,泄漏现象在Google云平台(GCP)的虚拟机间同样存在,并成功演示了对Blender的侧信道攻击,其F1分数超过88%。此外,我们探讨了潜在的防御措施,例如通过管理计数器访问权限和降低时钟精度来缓解这两类泄漏源。