GPU TEE在分布式数据并行机器学习训练中的开销特性分析 (Characterization of GPU TEE Overheads in Distributed Data Parallel ML Training)

Confidential computing (CC) or trusted execution enclaves (TEEs) is now the most common approach to enable secure computing in the cloud. The recent introduction of GPU TEEs by NVIDIA enables machine learning (ML) models to be trained without leaking model weights or data to the cloud provider. However, the potential performance implications of using GPU TEEs for ML training are not well characterized. In this work, we present an in-depth characterization study on performance overhead associated with running distributed data parallel (DDP) ML training with GPU Trusted Execution Environments (TEE). Our study reveals the performance challenges in DDP training within GPU TEEs. DDP uses ring-all-reduce, a well-known approach, to aggregate gradients from multiple devices. Ring all-reduce consists of multiple scatter-reduce and all-gather operations. In GPU TEEs only the GPU package (GPU and HBM memory) is trusted. Hence, any data communicated outside the GPU packages must be encrypted and authenticated for confidentiality and integrity verification. Hence, each phase of the ring-all-reduce requires encryption and message authentication code (MAC) generation from the sender, and decryption and MAC authentication on the receiver. As the number of GPUs participating in DDP increases, the overhead of secure inter-GPU communication during ring-all-reduce grows proportionally. Additionally, larger models lead to more asynchronous all-reduce operations, exacerbating the communication cost. Our results show that with four GPU TEEs, depending on the model that is being trained, the runtime per training iteration increases by an average of 8x and up to a maximum of 41.6x compared to DDP training without TEE.

翻译：机密计算（CC）或可信执行环境（TEE）现已成为在云端实现安全计算的最常用方法。NVIDIA近期推出的GPU TEE使得机器学习（ML）模型能够在训练过程中不向云服务提供商泄露模型权重或数据。然而，使用GPU TEE进行ML训练可能带来的性能影响尚未得到充分表征。本研究针对在GPU可信执行环境中运行分布式数据并行（DDP）ML训练所产生的性能开销进行了深入的特性分析。我们的研究揭示了在GPU TEE中进行DDP训练所面临的性能挑战。DDP采用广为人知的环状全归约方法聚合来自多个设备的梯度。环状全归约由多个分散-归约和全收集操作组成。在GPU TEE中，仅GPU封装（GPU与HBM内存）是可信的。因此，任何在GPU封装外部传输的数据都必须经过加密和认证，以确保机密性和完整性验证。因此，环状全归约的每个阶段都需要发送方进行加密和消息认证码（MAC）生成，接收方进行解密和MAC认证。随着参与DDP的GPU数量增加，环状全归约期间安全GPU间通信的开销成比例增长。此外，更大的模型会导致更多的异步全归约操作，从而加剧通信成本。我们的结果表明，使用四个GPU TEE时，根据所训练的模型，每个训练迭代的运行时间相较于无TEE的DDP训练平均增加8倍，最高可达41.6倍。