With the explosive growth of big data, workloads tend to get more complex and computationally demanding. Such applications are processed on distributed interconnected resources that are becoming larger in scale and computational capacity. Data-intensive applications may have different degrees of parallelism and must effectively exploit data locality. Furthermore, they may impose several Quality of Service requirements, such as time constraints and resilience against failures, as well as other objectives, like energy efficiency. These features of the workloads, as well as the inherent characteristics of the computing resources required to process them, present major challenges that require the employment of effective scheduling techniques. In this chapter, a classification of data-intensive workloads is proposed and an overview of the most commonly used approaches for their scheduling in large-scale distributed systems is given. We present novel strategies that have been proposed in the literature and shed light on open challenges and future directions.
翻译:随着大数据的爆炸式增长,工作负载日趋复杂且计算需求日益增加。此类应用在分布式互连资源上进行处理,这些资源的规模和计算能力正不断扩大。数据密集型应用可能具有不同程度的并行性,且必须有效利用数据局部性。此外,它们可能提出多种服务质量要求,如时间约束和故障恢复能力,以及其他目标,如能源效率。工作负载的这些特性,以及处理它们所需的计算资源的固有特征,带来了重大挑战,需要采用有效的调度技术。本章提出了数据密集型工作负载的分类方法,并概述了在大规模分布式系统中调度这些工作负载的常用方法。我们介绍了文献中提出的新颖策略,并揭示了开放挑战和未来研究方向。