Machine learning-based performance models are increasingly being used to build critical job scheduling and application optimization decisions. Traditionally, these models assume that data distribution does not change as more samples are collected over time. However, owing to the complexity and heterogeneity of production HPC systems, they are susceptible to hardware degradation, replacement, and/or software patches, which can lead to drift in the data distribution that can adversely affect the performance models. To this end, we develop continually learning performance models that account for the distribution drift, alleviate catastrophic forgetting, and improve generalizability. Our best model was able to retain accuracy, regardless of having to learn the new distribution of data inflicted by system changes, while demonstrating a 2x improvement in the prediction accuracy of the whole data sequence in comparison to the naive approach.
翻译:基于机器学习的性能模型日益被用于构建关键的作业调度和应用程序优化决策。传统上,这些模型假设随着时间推移收集更多样本时,数据分布不会发生变化。然而,由于生产级HPC系统的复杂性和异构性,它们容易受到硬件退化、更换和/或软件补丁的影响,这些因素可能导致数据分布漂移,从而对性能模型产生不利影响。为此,我们开发了持续学习的性能模型,该模型能够应对分布漂移、缓解灾难性遗忘、并提升泛化能力。我们最好的模型能够在必须学习系统变更所导致的新数据分布的情况下保持准确性,同时与原始方法相比,整个数据序列的预测精度提升了2倍。