Mapping parallel threads onto non-box-shaped domains is a known challenge in GPU computing; efficient mapping prevents performance penalties from unnecessary resource allocation. Currently, achieving this requires significant analytical human effort to manually derive bespoke mapping functions for each geometry. This work introduces a novel approach leveraging the symbolic reasoning of Large Language Models (LLMs) to automate this derivation entirely through in-context learning. Focusing on state-of-the-art open-weights models, we conducted a rigorous comparative analysis across spatial domains of increasing complexity. Our results demonstrate that modern local LLMs successfully infer exact O(1) and O(log N) mapping equations for complex 2D/3D dense domains and 2D fractals, vastly outperforming traditional symbolic regression methods. Crucially, we profile the energetic viability of this approach on high-performance infrastructure, distinguishing between the code-generation and execution phases. While one-time inference incurs a high energy penalty -- particularly for reasoning-focused models like DeepSeek-R1 -- this is a single upfront investment. Once integrated, the generated analytical kernels eliminate block waste entirely, yielding massive energy and time savings (e.g., up to 4833x speedup and 2890x energy reduction) during actual GPU workloads. Finally, we identify a current "reasoning ceiling" when these models face highly recursive 3D fractals (e.g., the Menger Sponge). This limitation benchmarks the present maturity of open-weight architectures, charting a viable path toward fully automated, energy-efficient GPU resource optimization.
翻译:将并行线程映射到非盒形域上是GPU计算中的一个已知挑战;高效映射可避免因不必要的资源分配而导致的性能损失。当前,实现这一目标需要大量人工分析工作,为每种几何形状手动推导定制化的映射函数。本文提出了一种新颖方法,利用大型语言模型的符号推理能力,完全通过上下文学习自动化这一推导过程。我们聚焦于最先进的开权重模型,在复杂度递增的空间域上进行了严格的比较分析。结果表明,现代本地LLM能够成功推断复杂2D/3D密集域和2D分形的精确O(1)和O(log N)映射方程,性能远超传统符号回归方法。关键的是,我们在高性能基础设施上剖析了该方法的能量可行性,区分了代码生成和执行阶段。尽管一次性推理会带来高能量开销——尤其是对于DeepSeek-R1等注重推理的模型——但这只是一次性的前期投入。一旦集成,生成的解析内核将完全消除块浪费,在实际GPU工作负载中实现巨大的能量和时间节省(例如,加速比高达4833倍,能耗降低2890倍)。最后,我们发现在面对高度递归的3D分形(如门格海绵)时,这些模型存在当前的“推理天花板”。这一局限性为开权重架构的当前成熟度提供了基准,并规划了一条通向全自动、高能效GPU资源优化的可行路径。