A Content-Driven Micro-Video Recommendation Dataset at Scale

Micro-videos have recently gained immense popularity, sparking critical research in micro-video recommendation with significant implications for the entertainment, advertising, and e-commerce industries. However, the lack of large-scale public micro-video datasets poses a major challenge for developing effective recommender systems. To address this challenge, we introduce a very large micro-video recommendation dataset, named "MicroLens", consisting of one billion user-item interaction behaviors, 34 million users, and one million micro-videos. This dataset also contains various raw modality information about videos, including titles, cover images, audio, and full-length videos. MicroLens serves as a benchmark for content-driven micro-video recommendation, enabling researchers to utilize various modalities of video information for recommendation, rather than relying solely on item IDs or off-the-shelf video features extracted from a pre-trained network. Our benchmarking of multiple recommender models and video encoders on MicroLens has yielded valuable insights into the performance of micro-video recommendation. We believe that this dataset will not only benefit the recommender system community but also promote the development of the video understanding field. Our datasets and code are available at https://github.com/westlake-repl/MicroLens.

翻译：微视频近年来广受欢迎，推动了微视频推荐领域的关键研究，对娱乐、广告和电子商务行业具有重要影响。然而，缺乏大规模公开的微视频数据集为开发有效的推荐系统带来了重大挑战。为解决这一问题，我们提出一个超大规模的微视频推荐数据集“MicroLens”，包含十亿条用户-物品交互行为、3400万用户和一百万条微视频。该数据集还涵盖视频的多种原始模态信息，包括标题、封面图像、音频和完整视频。MicroLens作为内容驱动的微视频推荐基准，使研究者能够利用视频的多模态信息进行推荐，而无需仅依赖物品ID或从预训练网络中提取的现成视频特征。我们在MicroLens上对多种推荐模型和视频编码器进行了基准测试，获得了关于微视频推荐性能的宝贵见解。我们相信，该数据集不仅将惠及推荐系统社区，还将推动视频理解领域的发展。数据集和代码已开源至https://github.com/westlake-repl/MicroLens。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日