Discrete GPU accelerators, while providing massive computing power for supercomputers and data centers, have their separate memory domain. Explicit memory management across device and host domains in programming is tedious and error-prone. To improve programming portability and productivity, Unified Memory (UM) integrates GPU memory into the host virtual memory systems, and provides transparent data migration between them and GPU memory oversubscription. Nevertheless, current UM technologies cause significant performance loss for applications. With AMD GPUs increasingly being integrated into the world's leading supercomputers, it is necessary to understand their Shared Virtual Memory (SVM) and mitigate the performance impacts. In this work, we delve into the SVM design, examine its interactions with applications' data accesses at fine granularity, and quantitatively analyze its performance effects on various applications and identify the performance bottlenecks. Our research reveals that SVM employs an aggressive prefetching strategy for demand paging. This prefetching is efficient when GPU memory is not oversubscribed. However, in tandem with the eviction policy, it causes excessive thrashing and performance degradation for certain applications under oversubscription. We discuss SVM-aware algorithms and SVM design changes to mitigate the performance impacts. To the best of our knowledge, this work is the first in-depth and comprehensive study for SVM technologies.
翻译:离散式GPU加速器在为超级计算机和数据中心提供海量计算能力的同时,具有独立的内存域。在编程中显式管理设备域与主机域之间的内存既繁琐又易出错。为提升编程可移植性与开发效率,统一内存(Unified Memory, UM)将GPU内存集成到主机虚拟内存系统中,提供两者间的透明数据迁移以及GPU内存的超额分配能力。然而,当前UM技术会显著降低应用程序性能。随着AMD GPU越来越多地集成到世界领先的超级计算机中,理解其共享虚拟内存(Shared Virtual Memory, SVM)机制并缓解性能影响变得至关重要。本文深入探究SVM设计,从细粒度层面分析其与应用程序数据访问的交互,定量评估其对各类应用的性能影响,并识别性能瓶颈。研究发现,SVM在按需分页中采用激进的预取策略。当GPU内存未超额分配时,该预取策略表现高效;然而,与逐出策略协同作用时,会导致某些应用在内存超额分配情况下出现过度抖动和性能下降。我们探讨了SVM感知算法与SVM设计改进方案以缓解性能影响。据我们所知,本文是首个针对SVM技术的深度全面研究。