LUMINA: LLM-Guided GPU Architecture Exploration via Bottleneck Analysis

GPU design space exploration (DSE) for modern AI workloads, such as Large-Language Model (LLM) inference, is challenging because of GPUs' vast, multi-modal design spaces, high simulation costs, and complex design optimization objectives (e.g. performance, power and area trade-offs). Existing automated DSE methods are often prohibitively expensive, either requiring an excessive number of exploration samples or depending on intricate, manually crafted analyses of interdependent critical paths guided by human heuristics. We present LUMINA, an LLM-driven GPU architecture exploration framework that leverage AI to enhance the DSE efficiency and efficacy for GPUs. LUMINA extracts architectural knowledge from simulator code and performs sensitivity studies to automatically compose DSE rules,which are auto-corrected during exploration. A core component of LUMINA is a DSE Benchmark that comprehensively evaluates and enhances LLMs' capabilities across three fundamental skills required for architecture optimization, which provides a principled and reproducible basis for model selection and ensuring consistent architectural reasoning. In the design space with 4.7 million possible samples, LUMINA identifies 6 designs of better performance and area than an A100 GPU efficiently, using only 20 steps via LLM-assisted bottleneck analysis. In comparison, LUMINA achieves 17.5x higher than design space exploration efficiency, and 32.9% better designs (i.e. Pareto Hypervolume) than Machine-Learning baselines, showcasing its ability to deliver high-quality design guidance with minimal search cost.

翻译：针对现代人工智能工作负载（如大语言模型推理）的GPU设计空间探索面临诸多挑战，这源于GPU庞大且多模态的设计空间、高昂的仿真成本以及复杂的设计优化目标（例如性能、功耗和面积的权衡）。现有的自动化设计空间探索方法通常成本过高，要么需要海量的探索样本，要么依赖于基于人类启发式规则、对相互依赖的关键路径进行复杂的手动分析。本文提出LUMINA，一个由大语言模型驱动的GPU架构探索框架，其利用人工智能提升GPU设计空间探索的效率和效果。LUMINA从仿真器代码中提取架构知识，并进行敏感性研究，以自动组合设计空间探索规则，这些规则在探索过程中会自动修正。LUMINA的一个核心组件是设计空间探索基准测试，它全面评估并增强大语言模型在架构优化所需的三个基本技能方面的能力，为模型选择和确保一致的架构推理提供了原则性且可复现的基础。在一个包含470万个可能样本的设计空间中，LUMINA通过大语言模型辅助的瓶颈分析，仅用20步就高效地识别出6个在性能和面积上均优于A100 GPU的设计。相比之下，LUMINA的设计空间探索效率比机器学习基线方法高出17.5倍，并获得了优于基线32.9%的设计（即帕累托超体积），展示了其以最小搜索成本提供高质量设计指导的能力。