Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from making additional low-level optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer. This paper describes and implements a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. We demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. Lastly, evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures.
翻译:各类应用程序通过自动化工具利用GPU,这些工具试图自动挖掘GPU并行架构的性能潜力。基于指令的编程模型(如OpenACC)通过为代码循环添加代码注释即可轻松实现并行计算。然而,这类抽象模型往往阻碍程序员利用GPU的高级架构特性进行底层优化,因为实际生成的运算对应用开发者而言是隐藏的。本文描述并实现了一种创新的灵活优化技术,该技术通过在编译管道的尾端插入代码模拟阶段来运作。我们的工具利用符号分析替代动态信息来模拟生成的代码,从而允许进一步应用底层代码优化。我们实现的工具支持将CUDA和OpenACC指令作为编译管道的前端,从而实现了此前无法做到的针对OpenACC的底层GPU优化。通过自动化甚至高级GPU程序员也难以使用的线程束级混洗指令,我们展示了该工具的能力。最后,我们使用基准测试套件和复杂应用程序代码对该工具进行评估,跨四代GPU架构详细研究了混洗指令的收益。