Towards Personalized Bangla Book Recommendation: A Large-Scale Heterogeneous Book Graph Dataset

from arxiv, Added new experiment results on sequential recommendation, top-N recommendation results have been updated using per user temporal leave-last-one-out instead of random split

Personalized book recommendation in Bangla literature has been constrained by the lack of structured, large-scale, and publicly available datasets. This work introduces RokomariBG, a large-scale heterogeneous book graph dataset designed to support research on personalized recommendation in a low-resource language setting. The dataset comprises 127,302 books, 63,723 users, 16,601 authors, 1,515 categories, 2,757 publishers, and 209,602 reviews, connected through several relation types and organized as a comprehensive knowledge graph. To demonstrate the utility of the dataset, we present a systematic benchmarking study on the top-N recommendation and sequential recommendation tasks, evaluating a diverse set of representative recommendation models. Through comprehensive benchmarking, we demonstrate that recommendation performance in this domain is strongly influenced by both heterogeneous relational information and code-mixed textual metadata. These findings reveal unique challenges of Bangladeshi e-commerce ecosystems that are largely absent from existing recommendation benchmarks. Overall, this work establishes a foundational benchmark and a publicly available resource for Bangla book recommendation research, enabling reproducible evaluation and future studies on recommendation in low-resource cultural domains. The dataset and code are publicly available at https://github.com/backlashblitz/Bangla-Book-Recommendation-Dataset

翻译：孟加拉语文学领域的个性化图书推荐受限于缺乏结构化、大规模且公开可用的数据集。本文提出了RokomariBG——一个大规模异构图书图谱数据集，旨在支持低资源语言环境下的个性化推荐研究。该数据集包含127,302本书籍、63,723名用户、16,601位作者、1,515个类别、2,757家出版社以及209,602条评论，通过多种关系类型连接，并组织为一张综合知识图谱。为展示该数据集的实用性，我们针对Top-N推荐和序列推荐任务进行了系统化的基准研究，评估了多种有代表性的推荐模型。通过全面的基准测试，我们发现该领域推荐性能同时受异构关系信息和代码混合文本元数据的显著影响。这些发现揭示了孟加拉国电子商务生态系统中独特的挑战，而这些挑战在现有推荐基准中基本不存在。总体而言，本研究为孟加拉语图书推荐研究建立了基础基准和公开可用资源，实现了可复现的评估以及未来低资源文化领域推荐研究。数据集与代码已在https://github.com/backlashblitz/Bangla-Book-Recommendation-Dataset 公开提供。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【新书】《知识图谱与大语言模型的协同应用》，544页pdf

专知会员服务

91+阅读 · 2025年10月29日

【新书】Essential GraphRAG: 知识图谱增强的RAG

专知会员服务

35+阅读 · 2025年7月17日

【经典书】Python地理信息数据分析，362页pdf

专知会员服务

90+阅读 · 2022年7月4日

【杜克-Bhuwan Dhingra】语言模型即知识图谱，46页ppt

专知会员服务

67+阅读 · 2021年11月15日