Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge due to their extensive computational and memory demands. While lightweight LLMs have been developed to fit mobile environments, they suffer from degraded model accuracy. In contrast, sparsity-based techniques minimize DRAM usage by selectively transferring only relevant neurons to DRAM while retaining the full model in external storage, such as flash. However, such approaches are critically limited by numerous I/O operations, particularly on smartphones with severe IOPS constraints. In this paper, we propose Ripple, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory. Ripple leverages the concept of Neuron Co-Activation, where neurons frequently activated together are linked to facilitate continuous read access and optimize data transfer efficiency. Our approach incorporates a two-stage solution: an offline stage that reorganizes neuron placement based on co-activation patterns, and an online stage that employs tailored data access and caching strategies to align well with hardware characteristics. Evaluations conducted on a variety of smartphones and LLMs demonstrate that Ripple achieves up to 5.93x improvements in I/O latency compared to the state-of-the-art. As the first solution to optimize storage placement under sparsity, Ripple explores a new optimization space at the intersection of sparsity-driven algorithm and storage-level system co-design in LLM inference.
翻译:大语言模型(LLM)已在多个领域取得显著成功,但由于其巨大的计算和内存需求,在移动设备上的部署仍面临严峻挑战。虽然已有轻量化LLM被开发以适应移动环境,但其模型精度存在下降。相比之下,基于稀疏性的技术通过选择性地仅将相关神经元传输至DRAM,同时将完整模型保留在外部存储(如闪存)中,从而最小化DRAM使用。然而,此类方法受到大量I/O操作的关键限制,尤其在具有严格IOPS约束的智能手机上。本文提出Ripple,一种通过优化闪存中神经元布局来加速智能手机上LLM推理的新方法。Ripple利用神经元协同激活的概念,将频繁共同激活的神经元进行关联,以实现连续读取访问并优化数据传输效率。我们的方法包含两阶段解决方案:离线阶段基于协同激活模式重组神经元布局;在线阶段采用定制化的数据访问与缓存策略,以更好地适配硬件特性。在多种智能手机和LLM上的评估表明,与最先进技术相比,Ripple在I/O延迟方面实现了最高5.93倍的提升。作为首个在稀疏性条件下优化存储布局的解决方案,Ripple探索了LLM推理中稀疏性驱动算法与存储级系统协同设计交叉领域的新优化空间。