Binge Watch: Reproducible Multimodal Benchmarks Datasets for Large-Scale Movie Recommendation on MovieLens-10M and 20M

With the growing interest in Multimodal Recommender Systems (MRSs), collecting high-quality datasets provided with multimedia side information (text, images, audio, video) has become a fundamental step. However, most of the current literature in the field relies on small- or medium-scale datasets that are either not publicly released or built using undocumented processes. In this paper, we aim to fill this gap by releasing M3L-10M and M3L-20M, two large-scale, reproducible, multimodal datasets for the movie domain, obtained by enriching with multimodal features the popular MovieLens-10M and MovieLens-20M, respectively. By following a fully documented pipeline, we collect movie plots, posters, and trailers, from which textual, visual, acoustic, and video features are extracted using several state-of-the-art encoders. We publicly release mappings to download the original raw data, the extracted features, and the complete datasets in multiple formats, fostering reproducibility and advancing the field of MRSs. In addition, we conduct qualitative and quantitative analyses that showcase our datasets across several perspectives. This work represents a foundational step to ensure reproducibility and replicability in the large-scale, multimodal movie recommendation domain. Our resource can be fully accessed at the following link: https://zenodo.org/records/18499145, while the source code is accessible at https://github.com/giuspillo/M3L_10M_20M.

翻译：随着多模态推荐系统（MRSs）日益受到关注，收集配备多媒体辅助信息（文本、图像、音频、视频）的高质量数据集已成为关键基础。然而，当前该领域的大多数研究依赖于中小规模数据集，这些数据集要么未公开发布，要么通过未公开流程构建。本文旨在填补这一空白，通过分别对广受欢迎的MovieLens-10M和MovieLens-20M进行多模态特征增强，发布了M3L-10M与M3L-20M这两个面向电影领域的大规模、可复现多模态数据集。我们遵循完全文档化的流程，收集了电影剧情简介、海报和预告片，并利用多种前沿编码器从中提取文本、视觉、声学和视频特征。我们公开发布了原始数据下载映射关系、提取的特征以及多种格式的完整数据集，以促进可复现性并推动MRSs领域发展。此外，我们通过定性与定量分析从多个维度展示了数据集的特性。本工作为确保大规模多模态电影推荐领域的可复现性与可验证性奠定了基石。完整资源可通过以下链接获取：https://zenodo.org/records/18499145，源代码访问地址为：https://github.com/giuspillo/M3L_10M_20M。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【WWW2026】用于多模态推荐的基础模型个性化参数高效微调研究

专知会员服务

5+阅读 · 2月20日

大语言模型在多模态推荐系统中的应用综述

专知会员服务

17+阅读 · 2025年5月17日

【SIGIR2025】在缺失模态场景中解耦与生成推荐模态

专知会员服务

7+阅读 · 2025年4月28日

【WWW2023教程】多模态推荐系统:解决稀疏性、可比性和可解释性

专知会员服务

42+阅读 · 2023年5月5日