Heterogeneous collaborative computing with NPU and CPU has received widespread attention due to its substantial performance benefits. To ensure data confidentiality and integrity during computing, Trusted Execution Environments (TEE) is considered a promising solution because of its comparatively lower overhead. However, existing heterogeneous TEE designs are inefficient for collaborative computing due to fine and different memory granularities between CPU and NPU. 1) The cacheline granularity of CPU TEE intensifies memory pressure due to its extra memory access, and 2) the cacheline granularity MAC of NPU escalates the pressure on the limited memory storage. 3) Data transfer across heterogeneous enclaves relies on the transit of non-secure regions, resulting in cumbersome re-encryption and scheduling. To address these issues, we propose TensorTEE, a unified tensor-granularity heterogeneous TEE for efficient secure collaborative tensor computing. First, we virtually support tensor granularity in CPU TEE to eliminate the off-chip metadata access by detecting and maintaining tensor structures on-chip. Second, we propose tensor-granularity MAC management with predictive execution to avoid computational stalls while eliminating off-chip MAC storage and access. Moreover, based on the unified granularity, we enable direct data transfer without re-encryption and scheduling dilemmas. Our evaluation is built on enhanced Gem5 and a cycle-accurate NPU simulator. The results show that TensorTEE improves the performance of Large Language Model (LLM) training workloads by 4.0x compared to existing work and incurs only 2.1% overhead compared to non-secure training, offering a practical security assurance for LLM training.
翻译:NPU与CPU的异构协同计算因其显著的性能优势而受到广泛关注。为确保计算过程中的数据机密性与完整性,可信执行环境(TEE)因其相对较低的开销被视为一种有前景的解决方案。然而,现有异构TEE设计由于CPU与NPU间精细且不同的内存粒度,在协同计算中存在效率问题。1)CPU TEE的缓存行粒度因其额外的内存访问加剧了内存压力;2)NPU的缓存行粒度消息认证码(MAC)加重了有限内存存储的压力;3)跨异构安全区的数据传输依赖于非安全区域的中转,导致繁琐的重加密与调度操作。为解决这些问题,我们提出TensorTEE,一种统一的张量粒度异构TEE,用于实现高效安全的协同张量计算。首先,我们通过在芯片上检测并维护张量结构,在CPU TEE中虚拟支持张量粒度,从而消除片外元数据访问。其次,我们提出结合预测执行的张量粒度MAC管理方案,在消除片外MAC存储与访问的同时避免计算停滞。此外,基于统一的粒度,我们实现了无需重加密且无调度困境的直接数据传输。我们的评估基于增强的Gem5与周期精确的NPU模拟器。实验结果表明,与现有工作相比,TensorTEE将大语言模型(LLM)训练任务的性能提升了4.0倍,且相较于非安全训练仅产生2.1%的开销,为LLM训练提供了切实可行的安全保障。