An obvious way to alleviate memory difficulties in GPU-based AI computing is via CPU offload, where data are moved between GPU and CPU RAM, so inexpensive CPU RAM is used to increase the amount of storage available. While CPU offload is an obvious idea, it can greatly slow down a computation, due to the relatively slow transfer rate between CPU RAM and GPU RAM. Thus, any system for CPU offload needs to ensure that when such a transfer needs to happen, no computation is blocked waiting for the transfer to finish. One of the key challenges when using CPU offload is that memory transfers introduce nondeterminacy into the system: it is not possible to know before runtime when the transfers will finish, and hence what is the best order of operations to run to ensure there is no blocking. In this paper, we describe TURNIP, which is a system for running AI computations using CPU offload. The key innovation in TURNIP is the compilation of the AI computation into a dependency graph that gives the TURNIP runtime freedom to run operations such as GPU kernel calls in many different orders; at runtime, TURNIP chooses the best order in response to real-time events.
翻译:缓解基于GPU的人工智能计算中内存难题的一种明显方法是通过CPU卸载,即在GPU和CPU RAM之间移动数据,从而利用廉价的CPU RAM来增加可用存储容量。尽管CPU卸载是一个显而易见的思路,但由于CPU RAM与GPU RAM之间相对较慢的传输速率,它可能显著降低计算速度。因此,任何CPU卸载系统都需要确保当此类传输必须发生时,不会有计算任务因等待传输完成而被阻塞。使用CPU卸载时的一个关键挑战在于,内存传输会向系统中引入非确定性:在运行时之前无法预知传输何时完成,因而也无法确定最佳的操作执行顺序以确保无阻塞。本文介绍TURNIP系统,该系统利用CPU卸载运行人工智能计算。TURNIP的核心创新在于将AI计算编译为依赖图,这使得TURNIP运行时能够自由地以多种不同顺序执行GPU内核调用等操作;在运行时,TURNIP会根据实时事件选择最佳执行顺序。