An $N$-point FFT admits many valid implementations that differ in radix choice, stage ordering, and register-blocking strategy. These alternatives use different SIMD instruction mixes with different latencies, yet produce the same mathematical result. We show that finding the fastest implementation is a shortest-path problem on a directed acyclic graph. We formalize two variants of this graph. In the \emph{context-free} model, nodes represent computation stages and edge weights are independently measured instruction costs. In the \emph{context-aware} model, nodes are expanded to encode the \emph{predecessor edge type}, so that edge weights capture inter-operation correlations such as cache warming -- the cost of operation~B depends on which operation~A preceded it. This addresses a limitation identified but deliberately bypassed by FFTW \citep{FrigoJohnson1998}: that optimal-substructure assumptions break down ``because of the different states of the cache.'' Applied to Apple M1 NEON, the context-free Dijkstra finds an arrangement at 22.1~GFLOPS (74\% of optimal). The context-aware Dijkstra discovers $\text{R4} \to \text{R2} \to \text{R4} \to \text{R4} \to \text{Fused-8}$ at 29.8~GFLOPS -- a $5.2\times$ improvement over pure radix-2 and 34\% faster than the context-free result. This arrangement includes a radix-2 pass \emph{sandwiched between} radix-4 passes, exploiting cache residuals that only exist in context. No context-free search can discover this.
翻译:一个$N$点FFT存在多种有效实现,它们在基底选择、级序排列和寄存器分块策略上各不相同。这些替代方案使用具有不同延迟的不同SIMD指令组合,但产生相同的数学结果。我们证明,寻找最快实现是一个有向无环图上的最短路径问题。我们形式化了该图的两种变体。在\textemph{上下文无关}模型中,节点表示计算级,边权重是独立测量的指令成本。在\textemph{上下文感知}模型中,节点经过扩展,编码\textemph{前驱边类型},使得边权重能够捕获操作间相关性,例如缓存预热——操作~B的成本取决于其前驱操作~A。这解决了FFTW \citep{FrigoJohnson1998}识别但有意绕过的一个局限性:最优子结构假设因“缓存状态不同”而失效。应用于Apple M1 NEON,上下文无关的Dijkstra算法找到一个22.1 GFLOPS(占最优的74%)的排列。上下文感知的Dijkstra算法发现$\text{R4} \to \text{R2} \to \text{R4} \to \text{R4} \to \text{Fused-8}$,达到29.8 GFLOPS——比纯基-2提高5.2倍,比上下文无关结果快34%。该排列包含一个夹在基-4级之间的基-2级,利用了仅在上下文中存在的缓存残差。没有任何上下文无关搜索能够发现这一点。