This study explores the use of automatic BLAS offloading and INT8-based emulation for accelerating traditional HPC workloads on modern GPU architectures. Through the use of low-bitwidth integer units and cache-coherent Unified Memory Architecture, we emulate double-precision matrix multiplications in the MuST application without code changes. We find that accuracy depends on both arithmetic precision and the properties of the operator, which can be dealt with through tunable precision emulation. Unlike traditional mixed-precision approaches, this method preserves original algorithms while optimizing hardware utilization. We showcases the potential of improving accuracy and performance at the same time. This work highlights the potential of AI-driven hardware to transform HPC, advocating for adaptive precision strategies in future scientific computing.
翻译:本研究探索了利用自动BLAS卸载和基于INT8的模拟技术,在现代GPU架构上加速传统高性能计算(HPC)工作负载的方法。通过采用低位宽整数单元和缓存一致统一内存架构,我们在未修改代码的情况下,于MuST应用中实现了双精度矩阵乘法的模拟。研究发现,计算精度既取决于算术精度,也受算子特性的影响,这一问题可通过可调精度模拟策略予以解决。与传统混合精度方法不同,本方法在保持原始算法不变的同时优化了硬件利用率。我们展示了同时提升精度与性能的潜力。本工作凸显了人工智能驱动硬件变革高性能计算领域的可能性,为未来科学计算中的自适应精度策略提供了理论依据。