Sustainability Is Not Linear: Quantifying Performance, Energy, and Privacy Trade-offs in On-Device Intelligence

The migration of Large Language Models (LLMs) from cloud clusters to edge devices promises enhanced privacy and offline accessibility, but this transition encounters a harsh reality: the physical constraints of mobile batteries, thermal limits, and, most importantly, memory constraints. To navigate this landscape, we constructed a reproducible experimental pipeline to profile the complex interplay between energy consumption, latency, and quality. Unlike theoretical studies, we captured granular power metrics across eight models ranging from 0.5B to 9B parameters without requiring root access, ensuring our findings reflect realistic user conditions. We harness this pipeline to conduct an empirical case study on a flagship Android device, the Samsung Galaxy S25 Ultra, establishing foundational hypotheses regarding the trade-offs between generation quality, performance, and resource consumption. Our investigation uncovered a counter-intuitive quantization-energy paradox. While modern importance-aware quantization successfully reduces memory footprints to fit larger models into RAM, we found it yields negligible energy savings compared to standard mixed-precision methods. This proves that for battery life, the architecture of the model, not its quantization scheme, is the decisive factor. We further identified that Mixture-of-Experts (MoE) architectures defy the standard size-energy trend, offering the storage capacity of a 7B model while maintaining the lower energy profile of a 1B to 2B model. Finally, an analysis of these multi-objective trade-offs reveals a pragmatic sweet spot of mid-sized models, such as Qwen2.5-3B, that effectively balance response quality with sustainable energy consumption.

翻译：大语言模型（LLMs）从云端集群向边缘设备的迁移，虽然有望增强隐私保护与离线可用性，但这一转变面临着严酷的现实：移动电池的物理限制、热限，以及最重要的——内存约束。为应对这一挑战，我们构建了一个可复现的实验管道，用于剖析能耗、延迟和质量之间复杂的相互作用。与理论研究不同，我们在无需获取根权限的情况下，捕获了从0.5B到9B参数的八种模型的精细功耗指标，确保我们的发现反映了真实用户条件。我们利用这一管道，在旗舰安卓设备三星Galaxy S25 Ultra上进行了一项实证案例研究，建立了关于生成质量、性能与资源消耗之间权衡的基本假设。我们的调查揭示了一个反直觉的量化-能量悖论。尽管现代重要性感知量化成功减少了内存占用，使更大模型能装入RAM，但我们发现，与标准混合精度方法相比，它带来的能量节省微乎其微。这证明了对于电池寿命而言，模型架构而非其量化方案才是决定性因素。我们进一步发现，混合专家（MoE）架构打破了标准尺寸-能量趋势，能够在保持1B至2B模型较低能耗水平的同时，提供7B模型的存储容量。最后，对这些多目标权衡的分析揭示了一个务实的中间地带：如Qwen2.5-3B这类中等规模模型，能在响应质量与可持续能耗之间实现有效平衡。