Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time.
翻译:当前分布式计算环境下,不同数据源以高速生成海量数据,使得传统关系型数据库难以有效捕获、管理和处理这些数据。Hadoop作为分布式环境下的集群并行存储与处理大规模数据集的工具,具有灵活性、可扩展性及高容错性等优势。然而其仍面临数据访问延迟、I/O操作开销及重复计算导致额外负载、资源浪费与性能低下等挑战。许多研究者采用缓存机制应对这些问题,例如提出改进数据访问速度、提升数据本地化率、消除重复计算、减少I/O操作次数、缩短作业执行时间及提高资源利用率的方法。本研究系统综述了提升Hadoop性能的缓存策略,并基于缓存利用率提出新颖分类体系。通过该分类框架,分析了各类策略对Hadoop性能的影响,讨论了每类方法的优劣。最后提出一种融合不同类别方法优势的混合智能缓存(Hybrid Intelligent Cache,HIC)方案,该方案整合了H-SVM-LRU与CLQLMRS两种技术。实验结果表明,本混合方法在作业执行时间上平均提升31.2%。