On-Device Large Language Models for Sequential Recommendation

On-device recommendation is critical for a number of real-world applications, especially in scenarios that have agreements on execution latency, user privacy, and robust functionality when internet connectivity is unstable or even impossible. While large language models (LLMs) can now provide exceptional capabilities that model user behavior for sequential recommendation tasks, their substantial memory footprint and computational overhead make the deployment on resource-constrained devices a high risk proposition. In this paper, we propose OD-LLM, the first task-adaptive compression framework explicitly designed to provide efficient and accurate on-device deployment of LLMs for sequential recommendation tasks. OD-LLM uniquely integrates two complementary compression strategies: a low-rank structural compression algorithm which uses Singular Value Decomposition (SVD) to significantly reduce parameter redundancy in the model, and a novel tokenization normalization technique that better complements the low-rank decomposition process being used. Additionally, to minimize any potential performance degradation when using higher compression ratios, a novel progressive alignment algorithm is used to iteratively refine the parameters required layerwise in the target model. Empirical evaluations conducted on sequential recommendation benchmarks show that OD-LLM exhibits no loss in effectiveness when compared to the original recommendation model, when the deployed model size is halved. These promising results demonstrate the efficacy and scalability of OD-LLM, making this novel solution a practical alternative for real-time, on-device solutions wishing to replace expensive, remotely executed LLMs.

翻译：设备端推荐对于众多现实应用至关重要，尤其在那些对执行延迟、用户隐私以及网络连接不稳定甚至不可用时的鲁棒性功能有明确要求的场景中。尽管大型语言模型（LLMs）目前能够为顺序推荐任务提供卓越的用户行为建模能力，但其庞大的内存占用和计算开销使得在资源受限的设备上部署成为一个高风险命题。本文提出OD-LLM，这是首个为顺序推荐任务明确设计的任务自适应压缩框架，旨在实现LLMs高效且精确的设备端部署。OD-LLM独特地整合了两种互补的压缩策略：一种采用奇异值分解（SVD）的低秩结构压缩算法，可显著降低模型中的参数冗余；以及一种新颖的词元归一化技术，该技术能更好地配合所使用的低秩分解过程。此外，为最大限度地减少使用较高压缩比时可能出现的性能下降，我们采用了一种新颖的渐进对齐算法，以迭代方式逐层优化目标模型所需的参数。在顺序推荐基准测试上进行的实证评估表明，当部署模型尺寸减半时，OD-LLM与原始推荐模型相比未出现有效性损失。这些积极结果证明了OD-LLM的有效性和可扩展性，使其成为希望替代昂贵远程执行LLMs的实时设备端解决方案的一个实用选择。