We present Nautilus, a novel tensor compiler that moves toward fully automated math-to-kernel optimization. Nautilus compiles a high-level algebraic specification of tensor operators into efficient tiled GPU kernels. Nautilus's successive lowering design allows high-level optimizations, expression rewrites, and tile optimizations to be jointly applied in a single end-to-end system. Nautilus presents a novel auto-scheduler that discovers sequences of high-level optimizations, while preserving the regular program structure needed by tile optimizers. Nautilus's auto-scheduler captures complex interactions and trade-offs in the high-level optimizations, including aggressive global transformations like advanced reduction fusion. Nautilus is the first end-to-end tensor compiler capable of starting from a math-like description of attention and automatically discovering FlashAttention-3-like kernels, offloading the entire burden of optimization from the programmer to the compiler. Across five transformer-based models and 150 evaluation configurations on NVIDIA GH200 and RTX 5090 GPUs, Nautilus achieves up to 23% higher throughput than state-of-the-art compilers on GH200 and up to 42% on RTX 5090, while matching or exceeding manually written cuDNN kernels on many long-sequence configurations.
翻译:我们提出Nautilus,一种迈向全自动数学到内核优化的新型张量编译器。Nautilus将张量算子的高级代数规范编译为高效的分块GPU内核。其逐步降级设计使高级优化、表达式重写与分块优化能够在单一端到端系统中联合应用。Nautilus提出一种新型自动调度器,在发现高级优化序列的同时,保留分块优化所需的规则程序结构。该自动调度器能够捕捉高级优化中的复杂交互与权衡,包括激进全局变换(如高级归约融合)。Nautilus是首个端到端张量编译器,能够从类数学描述的注意力机制出发,自动发现类似FlashAttention-3的内核,将程序员的全部优化负担转移至编译器。在NVIDIA GH200与RTX 5090 GPU上,针对五种基于Transformer的模型及150种评估配置,Nautilus在GH200上实现比现有最优编译器高23%的吞吐量,在RTX 5090上高42%,并在多数长序列配置中达到或超越手动编写的cuDNN内核性能。