One-sided dense matrix decompositions (e.g., Cholesky, LU, and QR) are the key components in scientific computing in many different fields. Although their design has been highly optimized for modern processors, they still consume a considerable amount of energy. As CPU-GPU heterogeneous systems are commonly used for matrix decompositions, in this work, we aim to further improve the energy saving of one-sided matrix decompositions on CPU-GPU heterogeneous systems. We first build an Algorithm-Based Fault Tolerance protected overclocking technique (ABFT-OC) to enable us to exploit reliable overclocking for key matrix decomposition operations. Then, we design an energy-saving matrix decomposition framework, Bi-directional Slack Reclamation(BSR), that can intelligently combine the capability provided by ABFT-OC and DVFS to maximize energy saving and maintain performance and reliability. Experiments show that BSR is able to save up to 11.7% more energy compared with the current best energy saving optimization approach with no performance degradation and up to 14.1% Energy * Delay^2 reduction. Also, BSR enables the Pareto efficient performance-energy trade-off, which is able to provide up to 1.43x performance improvement without costing extra energy.
翻译:单侧稠密矩阵分解(如Cholesky分解、LU分解和QR分解)是众多科学计算领域的核心组件。尽管其设计已针对现代处理器进行了高度优化,但能耗依然可观。鉴于CPU-GPU异构系统被广泛用于矩阵分解,本研究旨在进一步提升该类系统上单侧矩阵分解的能效优化水平。我们首先构建一种基于算法容错保护的超频技术(ABFT-OC),以实现对关键矩阵分解操作的可信超频。进而设计能效优化矩阵分解框架——双向松弛回收(BSR),该框架可智能融合ABFT-OC与动态电压频率调整(DVFS)技术,在保障性能与可靠性的前提下最大化能效收益。实验表明:与当前最优能效优化方案相比,BSR可在不损失性能的情况下额外节省最高11.7%的能耗,并实现最高14.1%的能耗延迟积(Energy*Delay²)缩减。此外,BSR实现了帕累托最优的性能-能耗权衡,可在不增加能耗的情况下提供最高1.43倍的性能提升。