Although deep pre-trained language models have shown promising benefit in a large set of industrial scenarios, including Click-Through-Rate (CTR) prediction, how to integrate pre-trained language models that handle only textual signals into a prediction pipeline with non-textual features is challenging. Up to now two directions have been explored to integrate multi-modal inputs in fine-tuning of pre-trained language models. One consists of fusing the outcome of language models and non-textual features through an aggregation layer, resulting into ensemble framework, where the cross-information between textual and non-textual inputs are only learned in the aggregation layer. The second one consists of splitting non-textual features into fine-grained fragments and transforming the fragments to new tokens combined with textual ones, so that they can be fed directly to transformer layers in language models. However, this approach increases the complexity of the learning and inference because of the numerous additional tokens. To address these limitations, we propose in this work a novel framework BERT4CTR, with the Uni-Attention mechanism that can benefit from the interactions between non-textual and textual features while maintaining low time-costs in training and inference through a dimensionality reduction. Comprehensive experiments on both public and commercial data demonstrate that BERT4CTR can outperform significantly the state-of-the-art frameworks to handle multi-modal inputs and be applicable to CTR prediction.
翻译:尽管深度预训练语言模型在包括点击率预测在内的大量工业场景中展现出显著优势,但如何将仅处理文本信号的预训练语言模型整合到包含非文本特征的预测流程中仍是一大挑战。目前,已有两种方向探索在预训练语言模型微调中融合多模态输入的方法。第一种方法通过聚合层融合语言模型输出与非文本特征,形成集成框架,其中文本与非文本输入的跨信息仅在聚合层中被学习。第二种方法将非文本特征拆解为细粒度片段,并将这些片段转化为与文本特征结合的新令牌,从而直接输入语言模型中的Transformer层。然而,由于引入了大量额外令牌,该方法增加了学习与推理的复杂度。为克服上述局限,本文提出新型框架BERT4CTR,其采用Uni-Attention机制,既能从非文本与文本特征的交互中获益,又通过降维策略保持训练与推理的低时间成本。在公开数据集与商业数据上的综合实验表明,BERT4CTR在处理多模态输入方面显著超越现有最优框架,并适用于点击率预测任务。