ResQ: Realistic Performance-Aware Query Generation

Database research and development rely heavily on realistic user workloads for benchmarking, instance optimization, migration testing, and database tuning. However, acquiring real-world SQL queries is notoriously challenging due to strict privacy regulations. While cloud database vendors have begun releasing anonymized performance traces to the research community, these traces typi- cally provide only high-level execution statistics without the origi- nal query text or data, which is insufficient for scenarios that require actual execution. Existing tools fail to capture fine-grained perfor- mance patterns or generate runnable workloads that reproduce these public traces with both high fidelity and efficiency. To bridge this gap, we propose ResQ, a fine-grained workload synthesis sys- tem designed to generate executable SQL workloads that faithfully match the per-query execution targets and operator distributions of production traces. ResQ constructs execution-aware query graphs, instantiates them into SQL via Bayesian Optimization-driven pred- icate search, and explicitly models workload repetition through reuse at both exact-query and parameterized-template levels. To ensure practical scalability, ResQ combines search-space bounding with lightweight local cost models to accelerate optimization. Ex- periments on public cloud traces (Snowset, Redset) and a newly released industrial trace (Bendset) demonstrate that ResQ signif- icantly outperforms state-of-the-art baselines, achieving 96.71% token savings and a 86.97% reduction in runtime, while lowering maximum Q-error by 14.8x on CPU time and 997.7x on scanned bytes, and closely matching operator composition.

翻译：数据库的研究与开发高度依赖于真实的用户工作负载，以进行基准测试、实例优化、迁移测试和数据库调优。然而，由于严格的隐私法规，获取真实世界的SQL查询极具挑战性。尽管云数据库供应商已开始向研究社区发布匿名的性能追踪数据，但这些追踪数据通常仅提供高级别的执行统计信息，而缺少原始查询文本或数据，这对于需要实际执行场景的应用而言是不够的。现有工具无法捕获细粒度的性能模式，也无法生成可运行的工作负载，以高保真度和高效率复现这些公开的追踪数据。为弥补这一差距，我们提出了ResQ，一个细粒度的工作负载合成系统，旨在生成可执行的SQL工作负载，使其能够忠实匹配生产追踪数据中每个查询的执行目标与算子分布。ResQ构建执行感知的查询图，通过贝叶斯优化驱动的谓词搜索将其实例化为SQL，并通过在精确查询和参数化模板两个层面上的重用，显式地建模工作负载的重复模式。为确保实际可扩展性，ResQ结合了搜索空间边界与轻量级本地成本模型来加速优化过程。在公开云追踪数据集（Snowset、Redset）和新发布的工业追踪数据集（Bendset）上的实验表明，ResQ显著优于现有最先进的基线方法，实现了96.71%的令牌节省和86.97%的运行时间减少，同时在CPU时间上将最大Q误差降低了14.8倍，在扫描字节数上降低了997.7倍，并紧密匹配了算子组成。