Large Language Models (LLMs) have achieved remarkable success across various domains, yet deploying them on mobile devices remains an arduous challenge due to their extensive computational and memory demands. While lightweight LLMs have been developed to fit mobile environments, they suffer from degraded model accuracy. In contrast, sparsity-based techniques minimize DRAM usage by selectively transferring only relevant neurons to DRAM while retaining the full model in external storage, such as flash. However, such approaches are critically limited by numerous I/O operations, particularly on smartphones with severe IOPS constraints. In this paper, we propose Ripple, a novel approach that accelerates LLM inference on smartphones by optimizing neuron placement in flash memory. Ripple leverages the concept of Neuron Co-Activation, where neurons frequently activated together are linked to facilitate continuous read access and optimize data transfer efficiency. Our approach incorporates a two-stage solution: an offline stage that reorganizes neuron placement based on co-activation patterns, and an online stage that employs tailored data access and caching strategies to align well with hardware characteristics. Evaluations conducted on a variety of smartphones and LLMs demonstrate that Ripple achieves up to 5.93x improvements in I/O latency compared to the state-of-the-art. As the first solution to optimize storage placement under sparsity, Ripple explores a new optimization space at the intersection of sparsity-driven algorithm and storage-level system co-design in LLM inference.
翻译:大语言模型(LLM)已在多个领域取得显著成功,然而由于其庞大的计算与内存需求,在移动设备上的部署仍面临严峻挑战。尽管已有轻量化LLM被开发以适应移动环境,但其模型精度往往有所下降。相比之下,基于稀疏化的技术通过仅将相关神经元选择性传输至DRAM,同时将完整模型保留在外部存储(如闪存)中,从而最小化DRAM占用。然而,此类方法受到大量I/O操作的限制,在I/O操作次数受限的智能手机上尤为突出。本文提出Ripple,一种通过优化闪存中神经元布局来加速智能手机上LLM推理的新方法。Ripple利用神经元协同激活的概念,将频繁共同激活的神经元关联存储,以实现连续读取访问并优化数据传输效率。我们的方案包含两阶段解决方案:离线阶段根据协同激活模式重组神经元布局;在线阶段采用定制化的数据访问与缓存策略,以更好地适配硬件特性。在多种智能手机和LLM上进行的评估表明,与现有最优方法相比,Ripple在I/O延迟方面最高可提升5.93倍。作为首个在稀疏化条件下优化存储布局的解决方案,Ripple探索了LLM推理中稀疏化驱动算法与存储级系统协同设计交叉领域的新优化空间。