Optimizing a Data Science System for Text Reuse Analysis

Text reuse is a methodological element of fundamental importance in humanities research: pieces of text that re-appear across different documents, verbatim or paraphrased, provide invaluable information about the historical spread and evolution of ideas. Large modern digitized corpora enable the joint analysis of text collections that span entire centuries and the detection of large-scale patterns, impossible to detect with traditional small-scale analysis. For this opportunity to materialize, it is necessary to develop efficient data science systems that perform the corresponding analysis tasks. In this paper, we share insights from ReceptionReader, a system for analyzing text reuse in large historical corpora. The system is built upon billions of instances of text reuses from large digitized corpora of 18th-century texts. Its main functionality is to perform downstream text reuse analysis tasks, such as finding reuses that stem from a given article or identifying the most reused quotes from a set of documents, with each task expressed as a database query. For the purposes of the paper, we discuss the related design choices including various database normalization levels and query execution frameworks, such as distributed data processing (Apache Spark), indexed row store engine (MariaDB Aria), and compressed column store engine (MariaDB Columnstore). Moreover, we present an extensive evaluation with various metrics of interest (latency, storage size, and computing costs) for varying workloads, and we offer insights from the trade-offs we observed and the choices that emerged as optimal in our setting. In summary, our results show that (1) for the workloads that are most relevant to text-reuse analysis, the MariaDB Aria framework emerges as the overall optimal choice, (2) big data processing (Apache Spark) is irreplaceable for all processing stages of the system's pipeline.

翻译：文本复用是人文研究中具有基础重要性的方法论要素：在不同文献中重复出现的文本片段（包括逐字引用或改写内容）能为思想的历史传播与演变提供宝贵信息。大型现代化数字语料库使得跨越数个世纪规模的文本集合联合分析成为可能，并能探测传统小规模分析无法发现的宏观模式。为实现这一研究机遇，需开发能执行相应分析任务的高效数据科学系统。本文分享了ReceptionReader系统的设计洞见——该系统专用于分析大型历史语料库中的文本复用现象。系统构建于18世纪大型数字化语料库中数十亿条文本复用实例之上，其核心功能是执行下游文本复用分析任务（如查找源自特定文章的复用案例，或识别文献集中被引用最多的名言），每项任务均以数据库查询形式表达。本文重点探讨了相关设计策略，涵盖不同数据库范式等级与查询执行框架，包括分布式数据处理（Apache Spark）、索引行存储引擎（MariaDB Aria）及压缩列存储引擎（MariaDB Columnstore）。此外，我们针对不同工作负载进行了包含延迟、存储空间与计算成本等多项指标的综合评估，并提供了基于观测权衡结果的洞见及经实践验证的最优配置方案。研究结果表明：（1）在文本复用分析最相关的工作负载场景下，MariaDB Aria框架整体表现最优；（2）大数据处理框架（Apache Spark）在系统管线的所有处理阶段均具有不可替代性。