Duet instrumentation: An Agentic Approach to Improving Sensitivity in Cloud Service Benchmarking

Continuous cloud service performance benchmarking is essential for detecting performance bugs early before deploying them to production. However, detecting performance regressions using application benchmarks, which usually treat the system under test as a black box, is challenging due to variable I/O calls or changing performance characteristics of the underlying cloud infrastructure. Microbenchmarks are often more sensitive and accurate, but also more time-consuming to implement and run. Further, they do not capture the performance of the integrated system as a whole. A comprehensive performance assessment therefore typically requires a combination of both approaches. To address the shortcomings of application benchmarks, we propose duet instrumentation, a novel benchmarking paradigm enabled by recent advancements in large language model (LLM) code understanding. The idea is to analyze code changes between two consecutive application versions and measure performance differences directly at performance-relevant changes during a synchronized benchmark of both application versions, uncovering performance changes with higher sensitivity. We design a system that reliably automates the assessment and instrumentation of performance-relevant code changes between the two application versions. In experiments with a realistic testbed application offering configurable performance regressions, we find that our prototype achieves 58% precision, 93% recall, and 71% specificity (averaged across tasks) when comparing the generated instrumentation against the ideal instrumentation with a line-distance threshold of five. In the downstream application benchmark, we find that our prototype can detect performance regressions at up to 5x lower injected severity compared to a traditional duet application benchmark while preserving similar A/A latency distributions.

翻译：持续云服务性能基准测试对于在部署至生产环境前及早发现性能缺陷至关重要。然而，使用应用层基准测试（通常将被测系统视为黑盒）检测性能回归面临挑战，原因在于底层云基础设施存在可变的I/O调用或动态变化的性能特征。微基准测试虽更具灵敏度和准确性，但实现与运行耗时较长，且无法捕获集成系统的整体性能。因此，全面的性能评估通常需要结合两种方法。针对应用基准测试的局限性，我们提出二重检测——一种由大语言模型代码理解能力最新进展所催生的新型基准测试范式。其核心思想是分析两个连续应用版本间的代码变更，并在同步基准测试中直接测量与性能相关的代码变更处的性能差异，从而以更高灵敏度揭示性能变化。我们设计了一套系统，可在两个应用版本间可靠地自动化完成性能相关代码变更的评估与检测。在具有可配置性能回归的真实测试床应用实验中，当将生成的检测与理想检测（基于五行的代码行距离阈值）对比时，我们原型系统实现了平均58%的精确率、93%的召回率和71%的特异度。在下游应用基准测试中，我们发现该原型系统可检测到低至传统二重应用基准测试五倍注入严重度的性能回归，同时保持相近的A/A延迟分布。