Distributed dataflow systems such as Apache Spark or Apache Flink enable parallel, in-memory data processing on large clusters of commodity hardware. Consequently, the appropriate amount of memory to allocate to the cluster is a crucial consideration. In this paper, we analyze the challenge of efficient resource allocation for distributed data processing, focusing on memory. We emphasize that in-memory processing with in-memory data processing frameworks can undermine resource efficiency. Based on the findings of our trace data analysis, we compile requirements towards an automated solution for efficient cluster resource allocation.
翻译:分布式数据流系统(如Apache Spark或Apache Flink)能够在商用硬件大规模集群上实现并行的内存数据处理。因此,为集群分配适当的内存资源是一个关键考量。本文分析了分布式数据处理中资源高效分配的挑战,重点关注内存方面。我们强调,使用内存数据处理框架进行内存处理可能会削弱资源效率。基于追踪数据分析的结果,我们整理了对自动化集群资源分配高效解决方案的需求。