Deploying large language models (LLMs) on heterogeneous edge devices demands frameworks that jointly optimize energy efficiency, inference quality, and reliability. Our prior QEIL v1 (Kumar & Jha, 2026) achieved 4.82x IPW improvement but relied on static efficiency factors, greedy optimization, and unverified candidate selection. QEIL v2 replaces every static heuristic with physics-grounded, runtime-adaptive models. We introduce three device-workload metrics: DASI (roofline-derived compute utilization), CPQ (memory pressure from allocation theory), and Phi (thermal yield from CMOS leakage physics), forming a unified energy equation with every coefficient traceable to semiconductor physics. For optimization, PGSAM (Pareto-Guided Simulated Annealing with Momentum) simultaneously minimizes energy, latency, and device underutilization. At inference time, the EAC/ARDE selection cascade with CSVET early stopping provides progressive verification among repeated samples. Evaluated on WikiText-103, GSM8K, and ARC-Challenge across seven model families (125M-8B parameters, including one pre-quantized variant), QEIL v2 achieves 75.7% pass@k at 63.8W (IPW=0.9749), a 2.86x improvement over standard inference. When applied to a 4-bit Llama-3.1-8B, QEIL v2's physics-grounded routing achieves IPW=1.024 at 54.8W -- the first edge orchestration system to surpass the IPW=1.0 empirical reference mark, with the gain attributable entirely to QEIL v2's workload-adaptive device allocation on a model with reduced memory bandwidth requirements. Total energy drops 75.6% vs. standard with 38.3% latency reduction, zero thermal throttling, and 100% fault recovery across all benchmarks and model families.
翻译:在异构边缘设备上部署大语言模型(LLM)需要同时优化能效、推理质量和可靠性的框架。我们先前的工作QEIL v1(Kumar & Jha, 2026)实现了4.82倍的IPW提升,但依赖于静态效率因子、贪婪优化和未经验证的候选选择。QEIL v2将每个静态启发式方法替换为基于物理原理且运行时自适应的模型。我们引入了三种设备-工作负载指标:DASI(基于Roofline模型的计算利用率)、CPQ(来自分配理论的存储压力)和Phi(来自CMOS泄漏物理的热产率),它们构成一个统一的能量方程,其中每个系数均可溯源于半导体物理。在优化方面,PGSAM(帕累托引导动量模拟退火)同时最小化能量、延迟和设备利用率不足。推理时,结合CSVET早停机制的EAC/ARDE选择级联,在重复样本间提供渐进式验证。在WikiText-103、GSM8K和ARC-Challenge数据集上,横跨七个模型系列(参数规模125M-8B,含一个预量化变体)的评估显示,QEIL v2在63.8W功耗下实现75.7%的pass@k(IPW=0.9749),相比标准推理提升2.86倍。当应用于4位量化的Llama-3.1-8B时,QEIL v2基于物理原理的路由在54.8W下达到IPW=1.024——这是首个突破IPW=1.0经验参考基准的边缘编排系统,其增益完全归因于QEIL v2在降低存储带宽需求的模型上采用工作负载自适应的设备分配。与标准方案相比,总能耗下降75.6%,延迟减少38.3%,零热节流,且在所有基准测试和模型系列上实现100%的故障恢复。