ELASTIC: Efficient Linear Attention for Sequential Interest Compression

from arxiv, We hereby withdraw this paper from arXiv due to incomplete experiments. Upon further review, we have determined that additional experimental work is necessary to fully validate our findings and conclusions

State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism. However, the quadratic computational and memory complexities of self attention have limited its scalability for modeling users' long range behaviour sequences. To address this problem, we propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression, requiring only linear time complexity and decoupling model capacity from computational cost. Specifically, ELASTIC introduces a fixed length interest experts with linear dispatcher attention mechanism which compresses the long-term behaviour sequences to a significantly more compact representation which reduces up to 90% GPU memory usage with x2.7 inference speed up. The proposed linear dispatcher attention mechanism significantly reduces the quadratic complexity and makes the model feasible for adequately modeling extremely long sequences. Moreover, in order to retain the capacity for modeling various user interests, ELASTIC initializes a vast learnable interest memory bank and sparsely retrieves compressed user's interests from the memory with a negligible computational overhead. The proposed interest memory retrieval technique significantly expands the cardinality of available interest space while keeping the same computational cost, thereby striking a trade-off between recommendation accuracy and efficiency. To validate the effectiveness of our proposed ELASTIC, we conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders. Experimental results demonstrate that ELASTIC consistently outperforms baselines by a significant margin and also highlight the computational efficiency of ELASTIC when modeling long sequences. We will make our implementation code publicly available.

翻译：当前最先进的序列推荐模型严重依赖于Transformer的注意力机制。然而，自注意力机制二次方的计算和内存复杂度限制了其在建模用户长程行为序列时的可扩展性。为解决这一问题，我们提出了ELASTIC（面向序列兴趣压缩的高效线性注意力机制），该模型仅需线性时间复杂度，并将模型容量与计算成本解耦。具体而言，ELASTIC引入了具有线性分配注意力机制的固定长度兴趣专家，将长期行为序列压缩为显著更紧凑的表示，从而减少高达90%的GPU内存使用，并实现2.7倍的推理加速。所提出的线性分配注意力机制显著降低了二次方复杂度，使模型能够有效建模极长序列。此外，为保持建模多样化用户兴趣的能力，ELASTIC初始化了一个大规模可学习的兴趣记忆库，并以可忽略的计算开销从记忆中稀疏检索压缩后的用户兴趣。所提出的兴趣记忆检索技术在保持相同计算成本的同时，显著扩展了可用兴趣空间的基数，从而在推荐准确性与效率之间取得了平衡。为验证ELASTIC的有效性，我们在多个公共数据集上进行了广泛实验，并将其与若干强序列推荐模型进行比较。实验结果表明，ELASTIC始终以显著优势超越基线模型，并在建模长序列时突显了其计算效率优势。我们将公开模型的实现代码。