Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.
翻译:采用思维链推理的大型语言模型(LLM)在复杂问题求解任务中实现了最先进的性能,但其冗长的推理轨迹和庞大的上下文需求使其难以实际部署于边缘设备。这些挑战包括高昂的令牌生成成本、巨大的键值缓存占用空间,以及将推理能力蒸馏到适用于移动设备的小型模型时存在的效率低下问题。现有方法通常依赖于将大型模型的推理轨迹蒸馏到小型模型中,这些轨迹往往冗长且存在风格冗余,不利于设备端推理。本研究提出一种轻量级方法,通过结合LoRA适配器与监督微调,使小型LLM具备推理能力。我们进一步引入基于强化学习的预算强制机制,显著缩短响应长度,同时仅带来极小的精度损失。为应对内存受限的解码问题,我们利用并行测试时缩放技术,以微小的延迟增加换取精度提升。最后,我们提出一种动态适配器切换机制(仅在需要时激活推理)及提示编码期间的键值缓存共享策略,从而降低设备端推理的首令牌生成时间。在Qwen2.5-7B模型上的实验表明,我们的方法能够在严格资源约束下实现高效、精确的推理,使LLM推理在移动场景中具备实用性。展示该方案在移动设备上运行效果的视频已发布于项目页面。