In the era of big data, conventional RDBMS models have become impractical for handling colossal workloads. Consequently, NoSQL databases have emerged as the preferred storage solutions for executing processing-intensive Online Analytical Processing (OLAP) tasks. Within the realm of NoSQL databases, various classifications exist based on their data storage mechanisms, making it challenging to select the most suitable one for a given OLAP workload. While each NoSQL database boasts distinct advantages, inherent scalability, adaptability to diverse data formats, and high data availability are universally recognized benefits crucial for managing OLAP workloads effectively. Existing research predominantly evaluates individual databases within custom data pipeline setups, lacking a standardized approach for comparative analysis across different databases to identify the optimal data pipeline for OLAP workloads. In this paper, we present our experimental insights into how various NoSQL databases handle OLAP workloads within a standardized data processing pipeline. Our experimental pipeline comprises Apache Spark for large-scale transformations, data cleansing, and schema normalization, diverse NoSQL databases as data stores, and a Business Intelligence tool for data analysis and visualization.
翻译:在大数据时代,传统的关系型数据库管理系统(RDBMS)模型在处理海量工作负载时已显得力不从心。因此,NoSQL数据库已成为执行处理密集型在线分析处理(OLAP)任务的首选存储解决方案。在NoSQL数据库领域,根据其数据存储机制存在多种分类,这使得为特定OLAP工作负载选择最合适的数据库具有挑战性。尽管每种NoSQL数据库都拥有独特的优势,但其固有的可扩展性、对不同数据格式的适应性以及高数据可用性,是公认的对有效管理OLAP工作负载至关重要的普遍优势。现有研究主要评估自定义数据管道设置中的单个数据库,缺乏一种标准化的方法来对不同数据库进行对比分析,以确定最适合OLAP工作负载的数据管道。本文中,我们基于标准化的数据处理管道,展示了关于不同NoSQL数据库如何处理OLAP工作负载的实验性见解。我们的实验管道包含用于大规模转换、数据清洗和模式规范化的Apache Spark,作为数据存储的各种NoSQL数据库,以及用于数据分析和可视化的商业智能工具。