This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries--such as Library Genesis and Z-Library--for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
翻译:本数据论文介绍了MajinBook,这是一个旨在促进影子图书馆(如Library Genesis和Z-Library)在计算社会科学和文化分析中应用的开放目录。通过将这些庞大、众包的档案库元数据与Goodreads的结构化书目数据关联,我们构建了一个高精度语料库,包含超过539,000条跨越三个世纪的英文书籍引用,并丰富了首次出版日期、体裁以及评分和评论等流行度指标。我们的方法优先采用原生数字EPUB文件以确保机器可读质量,同时解决了传统语料库(如HathiTrust)中的偏差,并包含法语、德语和西班牙语的次级数据集。我们评估了关联策略的准确性,公开释放所有底层数据,并讨论了该项目在欧盟和美国研究文本与数据挖掘法律框架下的合法性。