Large language models (LLMs) are becoming increasingly capable at small parameter scales. At the same time, conventional cloud-centric deployment introduces challenges around data privacy, latency, and cost that are acute in operational technology and defence environments. Advances in model distillation, quantisation, and affordable edge accelerators now make local LLM inference on single-board computers feasible, but the high dimensionality of the configuration space makes identifying optimal deployments difficult without structured evaluation. Existing LLM-specific edge benchmarking efforts rely on CPU-only inference, poor coverage of genuine single-board computers, and generic evaluation tasks that lack multi-dimensional assessment of hardware effectiveness. This paper proposes a multi-dimensional benchmarking methodology that jointly evaluates inference performance and hardware efficiency across four IoT-suitable edge platform configurations testing single-board computers with the latest available hardware accelerators. Our results reveal the benefits of using hardware accelerators such as NPUs and GPUs, along with multi-dimensional evaluations quantifying the trade-offs between power efficiency, physical device size and token throughput; offering practical guidance for deploying generative AI in privacy-sensitive and connectivity-limited environments such as unmanned vehicles and portable, ruggedised operations.
翻译:大型语言模型(LLM)在小参数规模下日益强大。与此同时,传统的以云为中心的部署带来了数据隐私、延迟和成本方面的挑战,这些问题在操作技术和国防环境中尤为突出。模型蒸馏、量化以及经济型边缘加速器的进步,使得在单板计算机上进行本地LLM推理成为可能,但配置空间的高维度使得缺乏结构化评估时难以确定最优部署方案。现有的LLM专用边缘基准测试工作依赖于仅限CPU的推理、对真正单板计算机的覆盖不足,以及缺乏硬件有效性多维评估的通用评估任务。本文提出了一种多维基准测试方法,该方法联合评估了四种适合物联网的边缘平台配置(测试配备最新可用硬件加速器的单板计算机)上的推理性能和硬件效率。我们的结果揭示了使用NPU和GPU等硬件加速器的优势,以及量化功耗效率、物理设备尺寸和令牌吞吐量之间权衡的多维评估;为在隐私敏感和连接受限环境(如无人驾驶车辆和便携式加固操作)中部署生成式AI提供了实用指导。