Performance modeling for large-scale data analytics workloads can improve the efficiency of cluster resource allocations and job scheduling. However, the performance of these workloads is influenced by numerous factors, such as job inputs and the assigned cluster resources. As a result, performance models require significant amounts of training data. This data can be obtained by exchanging runtime metrics between collaborating organizations. Yet, not all organizations may be inclined to publicly disclose such metadata. We present a privacy-preserving approach for sharing runtime metrics based on differential privacy and data synthesis. Our evaluation on performance data from 736 Spark job executions indicates that fully anonymized training data largely maintains performance prediction accuracy, particularly when there is minimal original data available. With 30 or fewer available original data samples, the use of synthetic training data resulted only in a one percent reduction in performance model accuracy on average.
翻译:大规模数据分析工作负载的性能建模能够提升集群资源分配与作业调度的效率。然而,此类工作负载的性能受作业输入与所分配集群资源等多种因素影响,因此性能模型需要大量训练数据。此类数据可通过跨组织协作共享运行时指标获得,但并非所有组织都愿意公开披露此类元数据。本文提出一种基于差分隐私与数据合成的运行时指标隐私保护共享方法。基于736次Spark作业执行的性能数据评估表明:完全匿名化的训练数据在多数情况下能保持性能预测精度,尤其在原始数据量极少时。当可用原始数据样本不超过30个时,使用合成训练数据平均仅导致性能模型精度下降1%。