Pre-trained models learn general representations from large datsets which can be fine-turned for specific tasks to significantly reduce training time. Pre-trained models like generative pretrained transformers (GPT), bidirectional encoder representations from transformers (BERT), vision transfomers (ViT) have become a cornerstone of current research in machine learning. This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie and the narrative text description of the movie. This system uses the BERT model to extract the information of text modality, the ViT model applied to extract the information of poster/image modality, and the Transformer architecture for feature fusion of all modalities to predict users' preference. The integration of pre-trained foundational models with some smaller data sets in downstream applications capture multi-modal content features in a more comprehensive manner, thereby providing more accurate recommendations. The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets. The prediction accuracy of user ratings is enhanced in comparison to the baseline algorithm, thereby demonstrating the potential of this cross-modal algorithm to be applied for movie or video recommendation.
翻译:预训练模型通过从大规模数据集中学习通用表征,可针对特定任务进行微调,从而显著减少训练时间。诸如生成式预训练Transformer(GPT)、基于Transformer的双向编码器表征(BERT)、视觉Transformer(ViT)等预训练模型已成为当前机器学习研究的基石。本研究提出一种多模态电影推荐系统,通过提取每部电影精心设计的海报特征与电影叙事文本描述的特征实现推荐。该系统采用BERT模型提取文本模态信息,应用ViT模型提取海报/图像模态信息,并利用Transformer架构对所有模态进行特征融合以预测用户偏好。将预训练基础模型与下游应用中的较小数据集相结合,能够更全面地捕捉多模态内容特征,从而提供更精准的推荐。概念验证模型的有效性通过标准基准问题MovieLens 100K和1M数据集得到验证。与基线算法相比,用户评分预测准确率得到提升,这证明了该跨模态算法在电影或视频推荐领域的应用潜力。