Deploying large language models (LLMs) on smartphones poses significant engineering challenges due to stringent constraints on memory, latency, and runtime flexibility. In this work, we present a hardware-aware framework for efficient on-device inference of a LLaMA-based multilingual foundation model supporting multiple use cases on Samsung Galaxy S24 and S25 devices with SM8650 and SM8750 Qualcomm chipsets respectively. Our approach integrates application-specific LoRAs as runtime inputs to a single frozen inference graph, enabling dynamic task switching without recompilation or memory overhead. We further introduce a multi-stream decoding mechanism that concurrently generates stylistic variations - such as formal, polite, or jovial responses - within a single forward pass, reducing latency by up to 6x. To accelerate token generation, we apply Dynamic Self-Speculative Decoding (DS2D), a tree-based strategy that predicts future tokens without requiring a draft model, yielding up to 2.3x speedup in decode time. Combined with quantization to INT4 and architecture-level optimizations, our system achieves 4-6x overall improvements in memory and latency while maintaining accuracy across 9 languages and 8 tasks. These results demonstrate practical feasibility of deploying multi-use-case LLMs on edge devices, advancing the commercial viability of Generative AI in mobile platforms.
翻译:将大语言模型部署到智能手机上,面临内存、延迟及运行时灵活性等方面的严格约束,构成重大工程挑战。本文提出一种面向硬件的框架,用于在采用SM8650和SM8750骁龙芯片组的三星Galaxy S24和S25设备上,高效实现基于LLaMA的多语言基础模型的设备端推理。我们的方法将特定应用的LoRA作为运行时输入,集成到单一的冻结推理图中,从而实现动态任务切换,无需重新编译或增加内存开销。我们进一步引入多流解码机制,可在单次前向传播中并发生成不同风格变体(例如正式、礼貌或幽默回复),将延迟降低高达6倍。为加速令牌生成,我们应用动态自推测解码,这是一种基于树的策略,无需草稿模型即可预测未来令牌,使解码速度提升高达2.3倍。结合INT4量化和架构级优化,我们的系统在内存和延迟方面实现4-6倍的整体改进,同时在9种语言和8项任务中保持准确率。这些结果证明了在边缘设备上部署多用例大语言模型的实践可行性,推动了生成式AI在移动平台上的商业应用潜力。