Deploying large language models (LLMs) on resource constrained edge devices is limited by a poor understanding of inference time scaling on heterogeneous hardware. We present QEIL (Quantifying Edge Intelligence via Inference time Scaling Formalisms), a unified framework to characterize and optimize inference across CPUs, GPUs, and NPUs. QEIL reveals stable power law scaling behavior in latency, energy, and task coverage for transformer models ranging from 125M to 2.6B parameters, and demonstrates that heterogeneous orchestration with intelligent coordination across mixed accelerators consistently improves energy efficiency and coverage compared to homogeneous execution. QEIL introduces three composite metrics: Intelligence per Watt, Energy Coverage Efficiency, and Price Power Performance, enabling multi objective optimization for edge intelligence. A safety first agentic orchestrator dynamically allocates workloads across same vendor and cross vendor accelerators while enforcing thermal constraints, fault tolerant execution, adversarial input validation, and continuous hardware health monitoring. Evaluations across five model families show that QEIL achieves consistent improvements in efficiency, latency, and coverage without sacrificing accuracy or system safety, establishing inference time scaling and heterogeneous orchestration as key foundations for reliable edge AI.
翻译:在资源受限的边缘设备上部署大型语言模型(LLMs)受到对异构硬件推理时间缩放理解不足的限制。我们提出了QEIL(通过推理时间缩放形式化方法量化边缘智能),这是一个用于表征和优化跨CPU、GPU和NPU推理的统一框架。QEIL揭示了参数规模从1.25亿到26亿的Transformer模型在延迟、能耗和任务覆盖范围方面存在稳定的幂律缩放行为,并证明与同构执行相比,通过混合加速器间的智能协调进行异构编排能持续提升能效和覆盖范围。QEIL引入了三个复合指标:每瓦特智能、能量覆盖效率和价格功率性能,从而支持边缘智能的多目标优化。一个安全优先的智能编排器在强制执行热约束、容错执行、对抗性输入验证和持续硬件健康监测的同时,动态分配跨同厂商及跨厂商加速器的工作负载。在五个模型系列上的评估表明,QEIL在效率、延迟和覆盖范围上实现了持续改进,且未牺牲准确性或系统安全性,从而确立了推理时间缩放与异构编排作为可靠边缘AI的关键基础。