We describe the Bandicoot GPU linear algebra toolkit, a C++ based library that prioritises ease of use without compromising efficiency. Bandicoot's API is compatible with the popular Armadillo CPU linear algebra library, enabling easy transition for existing CPU-based codebases. Unlike other GPU-focused toolkits, Bandicoot uses template metaprogramming to generate fused GPU kernels directly at compile time, yielding efficient kernels that are often able to saturate memory bandwidth. This removes the need for runtime overhead or JIT infrastructure. Empirical results show that Bandicoot outperforms (sometimes by considerable margins) commonly-used linear algebra toolkits including PyTorch, TensorFlow, and JAX.
翻译:我们介绍了Bandicoot GPU线性代数工具包,这是一个基于C++的库,在保持高效性的同时优先考虑易用性。Bandicoot的API与流行的Armadillo CPU线性代数库兼容,便于现有基于CPU的代码库进行迁移。与其他专注于GPU的工具包不同,Bandicoot利用模板元编程直接在编译时生成融合的GPU内核,从而产生能够经常饱和内存带宽的高效内核。这消除了运行时开销或即时编译基础设施的需求。实验结果表明,Bandicoot的性能优于(有时显著领先)常用的线性代数工具包,包括PyTorch、TensorFlow和JAX。