We study user history modeling via Transformer encoders in deep learning recommendation models (DLRM). Such architectures can significantly improve recommendation quality, but usually incur high latency cost necessitating infrastructure upgrades or very small Transformer models. An important part of user history modeling is early fusion of the candidate item and various methods have been studied. We revisit early fusion and compare concatenation of the candidate to each history item against appending it to the end of the list as a separate item. Using the latter method, allows us to reformulate the recently proposed amortized history inference algorithm M-FALCON \cite{zhai2024actions} for the case of DLRM models. We show via experimental results that appending with cross-attention performs on par with concatenation and that amortization significantly reduces inference costs. We conclude with results from deploying this model on the LinkedIn Feed and Ads surfaces, where amortization reduces latency by 30\% compared to non-amortized inference.
翻译:本研究探讨了在深度学习推荐模型(DLRM)中通过Transformer编码器进行用户历史建模的方法。此类架构能显著提升推荐质量,但通常伴随高延迟成本,需要基础设施升级或采用极小的Transformer模型。用户历史建模的关键环节是候选物品的早期融合,已有多种方法被研究。我们重新审视早期融合策略,比较了将候选物品与每个历史物品拼接的方法与将其作为独立项附加在列表末尾的方法。采用后者使我们能够针对DLRM模型,重新构建最近提出的摊销历史推理算法M-FALCON \cite{zhai2024actions}。实验结果表明,结合交叉注意力的附加方法与拼接方法性能相当,且摊销机制显著降低了推理成本。最后,我们展示了该模型在LinkedIn信息流和广告平台部署的结果,其中摊销推理相比非摊销推理降低了30%的延迟。