Sequential recommendation (SR) models predict a user's next interaction by modeling their historical behaviors. Transformer-based SR methods, notably BERT4Rec, effectively capture these patterns but incur significant computational overhead due to extensive intermediate computations associated with Softmax-based attention. We propose Cotten4Rec, a novel SR model utilizing linear-time cosine similarity attention, implemented through a single optimized compute unified device architecture (CUDA) kernel. By minimizing intermediate buffers and kernel-launch overhead, Cotten4Rec substantially reduces resource usage compared to BERT4Rec and the linear-attention baseline, LinRec, especially for datasets with moderate sequence lengths and vocabulary sizes. Evaluations across three benchmark datasets confirm that Cotten4Rec achieves considerable reductions in memory and runtime with minimal compromise in recommendation accuracy, demonstrating Cotten4Rec's viability as an efficient alternative for practical, large-scale sequential recommendation scenarios where computational resources are critical.
翻译:序列推荐模型通过建模用户的历史行为来预测其下一次交互。基于Transformer的序列推荐方法,特别是BERT4Rec,能有效捕捉这些模式,但由于基于Softmax的注意力机制涉及大量中间计算,会产生显著的计算开销。我们提出了Cotten4Rec,一种新颖的序列推荐模型,它利用线性时间余弦相似度注意力,并通过单个优化的统一计算设备架构内核实现。通过最小化中间缓冲区和内核启动开销,与BERT4Rec以及线性注意力基线模型LinRec相比,Cotten4Rec显著降低了资源使用,尤其适用于序列长度和词汇表大小适中的数据集。在三个基准数据集上的评估证实,Cotten4Rec在内存和运行时间上实现了大幅减少,同时推荐准确性损失极小,这证明了Cotten4Rec作为一种高效替代方案,在计算资源至关重要的实际大规模序列推荐场景中具有可行性。