CAMEO: A Causal Transfer Learning Approach for Performance Optimization of Configurable Computer Systems

Modern computer systems are highly-configurable, with hundreds of configuration options interacting, resulting in enormous configuration space. As a result, optimizing performance goals (e.g., latency) in such systems is challenging. Worse, owing to evolving application requirements and user specifications, these systems face frequent uncertainties in their environments (e.g., hardware and workload change), making performance optimization even more challenging. Recently, transfer learning has been applied to address this problem by reusing knowledge from the offline configuration measurements of an old environment, aka, source to a new environment, aka, target. These approaches typically rely on predictive machine learning (ML) models to guide the search for finding interventions to optimize performance. However, previous empirical research showed that statistical models might perform poorly when the deployment environment changes because the independent and identically distributed (i.i.d.) assumption no longer holds. To address this issue, we propose Cameo -- a method that sidesteps these limitations by identifying invariant causal predictors under environmental changes, enabling the optimization process to operate on a reduced search space, leading to faster system performance optimization. We demonstrate significant performance improvements over the state-of-the-art optimization methods on five highly configurable computer systems, including three MLperf deep learning benchmark systems, a video analytics pipeline, and a database system, and studied the effectiveness in design explorations with different varieties and severity of environmental changes and show the scalability of our approach to colossal configuration spaces.

翻译：现代计算机系统具有高度可配置性，数百个配置选项相互作用，导致配置空间极其庞大。因此，在这类系统中优化性能目标（例如延迟）充满挑战。更糟糕的是，由于不断演进的应用程序需求和用户规范，这些系统经常面临环境不确定性（例如硬件和工作负载变化），使得性能优化愈发困难。近年来，迁移学习通过复用旧环境（即源域）的离线配置测量知识至新环境（即目标域）来解决该问题。这类方法通常依赖预测性机器学习模型来引导搜索，以寻找优化性能的干预措施。然而，先前的实证研究表明，当部署环境发生变化时，统计模型可能表现不佳，因为独立同分布假设不再成立。为解决此问题，我们提出Cameo——一种通过识别环境变化下的不变因果预测因子来规避上述局限性的方法，使优化过程能够在缩减后的搜索空间中运行，从而加速系统性能优化。我们在五个高度可配置的计算机系统（包括三个MLPerf深度学习基准测试系统、一个视频分析流水线和一个数据库系统）上展示了该方法相较于现有最优优化方法的显著性能提升，并研究了不同种类和严重程度的环境变化对设计探索效果的影响，同时证明了该方法对庞大配置空间的可扩展性。