Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly 20.8 hours.

翻译：大型多模态模型本质上是模块化的，由视觉与音频编码器、投影器以及大型语言模型构成。然而，其执行过程几乎总是以单体形式进行，这未能充分利用现代片上系统中异构加速器的潜力，并导致较高的端到端延迟。本文提出NANOMIND，一种面向大型多模态模型的软硬件协同设计推理框架，该框架将大模型拆解为模块化“积木”，并将每个模块映射至其最理想的加速器。其核心洞见在于：大模型可被分解为模块化组件，并调度至最适宜的计算单元执行。该系统在统一内存架构的片上系统上实现了跨加速器的模块级动态卸载。通过结合定制化硬件设计、系统级调度以及优化的低位宽计算内核，我们在一款紧凑型电池供电设备上验证了该框架，实现了大型多模态模型的完全端侧运行。该原型设备可作为自包含的智能助手独立工作，无需网络连接，同时在严格资源约束下实现了更高吞吐量与更优能效。该设计通过令牌感知缓冲区管理与模块级协同机制，进一步规避了CPU瓶颈并降低了冗余内存使用。我们的系统在资源效率上优于现有方案，能耗降低42.3%，GPU内存使用减少11.2%。这使得一款电池供电设备能够搭载摄像头运行LLaVA-OneVision模型近20.8小时。