Design of a GPU with Heterogeneous Cores for Graphics

Heterogeneous architectures can deliver higher performance and energy efficiency than symmetric counterparts by using multiple architectures tuned to different types of workloads. While previous works focused on CPUs, this work extends the concept of heterogeneity to GPUs by proposing KHEPRI, a heterogeneous GPU architecture for graphics applications. Scenes in graphics applications showcase diversity, as they consist of many objects with varying levels of complexity. As a result, computational intensity and memory bandwidth requirements differ significantly across different regions of each scene. To address this variability, our proposal includes two types of cores: cores optimized for high ILP (compute-specialized) and cores that tolerate a higher number of simultaneously outstanding cache misses (memory-specialized). A key component of the proposed architecture is a novel work scheduler that dynamically assigns each part of a frame (i.e., a tile) to the most suitable core. Designing this scheduler is particularly challenging, as it must preserve data locality; otherwise, the benefits of heterogeneity may be offset by the penalty of additional cache misses. Additionally, the scheduler requires knowledge of each tile's characteristics before rendering it. For this purpose, KHEPRI leverages frame-to-frame coherence to predict the behavior of each tile based on that of the corresponding tile in the previous frame. Evaluations across a wide range of commercial animated graphics applications show that, compared to a traditional homogeneous GPU, KHEPRI achieves an average performance improvement of 9.2%, a throughput increase (frames per second) of 7.3%, and a total GPU energy reduction of 4.8%. Importantly, these benefits are achieved without any hardware overhead.

翻译：异构架构通过采用针对不同类型工作负载优化的多种架构，相较于对称架构能够提供更高的性能和能效。先前的研究主要集中于CPU领域，而本研究将异构概念扩展至GPU，提出了一种面向图形应用的异构GPU架构——KHEPRI。图形应用中的场景呈现出多样性，由大量具有不同复杂度的物体构成。因此，同一场景不同区域的计算强度与内存带宽需求存在显著差异。为应对这种差异性，我们的方案包含两类核心：针对高指令级并行度优化的计算专用核心，以及能够容忍更多同时发生的缓存缺失的内存专用核心。该架构的一个关键组件是一种新颖的工作调度器，它能将帧的每个部分（即图块）动态分配给最合适的核心。设计此调度器尤其具有挑战性，因为它必须保持数据局部性；否则，异构性带来的优势可能被额外缓存缺失的代价所抵消。此外，调度器需要在渲染每个图块前了解其特性。为此，KHEPRI利用帧间一致性，基于前一帧对应图块的行为来预测当前图块的特性。在广泛的商业动画图形应用上的评估表明，相较于传统的同构GPU，KHEPRI平均实现了9.2%的性能提升、7.3%的吞吐量（每秒帧数）增长，以及4.8%的GPU总能耗降低。重要的是，这些优势的取得无需任何额外的硬件开销。