Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is limited by on-module memory bandwidth. The RTX 4050 sustains 131.7 tok/s at 34.1 W; the Hailo-10H sustains 6.9 tok/s at under 2 W with near-zero variance, matching the RTX 4050 in energy proportionality at 19x lower throughput. Results should be interpreted as platform-level deployment characterisations for a single model and prompt type, reflecting hardware and software combined, rather than general claims about hardware capability alone.
翻译:在设备端部署大语言模型以实现始终在线的个人智能代理,要求硬件在功率、热设计功耗和内存严格受限的条件下执行持续推理。我们对Qwen 2.5 1.5B(4比特量化)在四个平台进行基准测试:搭载Hailo-10H NPU的树莓派5、三星Galaxy S24 Ultra、iPhone 16 Pro以及搭载英伟达RTX 4050 GPU的笔记本电脑。通过每台设备在20次热启动迭代中使用固定258词元的提示词,我们测量了吞吐量、延迟、功耗和热行为。在移动平台上,热管理取代峰值算力成为主要约束:iPhone 16 Pro在两次迭代内吞吐量下降近一半,而S24 Ultra因操作系统强制执行GPU频率下限导致推理完全终止。在专用硬件上,不同约束占据主导地位:RTX 4050受限于其电池供电功率上限,Hailo-10H则受限于模组内内存带宽。RTX 4050在34.1瓦功耗下保持131.7词元/秒的持续性能;Hailo-10H在功耗低于2瓦且方差近乎为零的条件下实现6.9词元/秒,尽管吞吐量仅为前者的十九分之一,但在能效比例上与之相当。这些结果应解读为针对单一模型和提示词类型进行的平台级部署特性描述,反映了硬件与软件的综合表现,而非对单一硬件能力的泛化结论。