QEIL v2: Heterogeneous Computing for Edge Intelligence via Roofline-Derived Pareto-Optimal Energy Modeling and Multi-Objective Orchestration

Deploying large language models (LLMs) on heterogeneous edge devices demands frameworks that jointly optimize energy efficiency, inference quality, and reliability. Our prior QEIL v1 (Kumar & Jha, 2026) achieved 4.82x IPW improvement but relied on static efficiency factors, greedy optimization, and unverified candidate selection. QEIL v2 replaces every static heuristic with physics-grounded, runtime-adaptive models. We introduce three device-workload metrics: DASI (roofline-derived compute utilization), CPQ (memory pressure from allocation theory), and Phi (thermal yield from CMOS leakage physics), forming a unified energy equation with every coefficient traceable to semiconductor physics. For optimization, PGSAM (Pareto-Guided Simulated Annealing with Momentum) simultaneously minimizes energy, latency, and device underutilization. At inference time, the EAC/ARDE selection cascade with CSVET early stopping provides progressive verification among repeated samples. Evaluated on WikiText-103, GSM8K, and ARC-Challenge across seven model families (125M-8B parameters, including one pre-quantized variant), QEIL v2 achieves 75.7% pass@k at 63.8W (IPW=0.9749), a 2.86x improvement over standard inference. When applied to a 4-bit Llama-3.1-8B, QEIL v2's physics-grounded routing achieves IPW=1.024 at 54.8W -- the first edge orchestration system to surpass the IPW=1.0 empirical reference mark, with the gain attributable entirely to QEIL v2's workload-adaptive device allocation on a model with reduced memory bandwidth requirements. Total energy drops 75.6% vs. standard with 38.3% latency reduction, zero thermal throttling, and 100% fault recovery across all benchmarks and model families.

翻译：在异构边缘设备上部署大语言模型（LLM）需要同时优化能效、推理质量和可靠性的框架。我们先前的工作QEIL v1（Kumar & Jha, 2026）实现了4.82倍的IPW提升，但依赖于静态效率因子、贪婪优化和未经验证的候选选择。QEIL v2将每个静态启发式方法替换为基于物理原理且运行时自适应的模型。我们引入了三种设备-工作负载指标：DASI（基于Roofline模型的计算利用率）、CPQ（来自分配理论的存储压力）和Phi（来自CMOS泄漏物理的热产率），它们构成一个统一的能量方程，其中每个系数均可溯源于半导体物理。在优化方面，PGSAM（帕累托引导动量模拟退火）同时最小化能量、延迟和设备利用率不足。推理时，结合CSVET早停机制的EAC/ARDE选择级联，在重复样本间提供渐进式验证。在WikiText-103、GSM8K和ARC-Challenge数据集上，横跨七个模型系列（参数规模125M-8B，含一个预量化变体）的评估显示，QEIL v2在63.8W功耗下实现75.7%的pass@k（IPW=0.9749），相比标准推理提升2.86倍。当应用于4位量化的Llama-3.1-8B时，QEIL v2基于物理原理的路由在54.8W下达到IPW=1.024——这是首个突破IPW=1.0经验参考基准的边缘编排系统，其增益完全归因于QEIL v2在降低存储带宽需求的模型上采用工作负载自适应的设备分配。与标准方案相比，总能耗下降75.6%，延迟减少38.3%，零热节流，且在所有基准测试和模型系列上实现100%的故障恢复。