Large-scale cloud security platforms must continuously query millions of structured cloud resource records distributed across thousands of tenant accounts. Broad, account-spanning queries saturate database infrastructure, producing P95 latencies exceeding 60 seconds. We identify buffer cache pressure as the dominant latency driver: in a controlled experiment, the same query executing with the same plan completed in 3.7 seconds when its working set was memory-resident and 94 seconds when concurrent load had evicted those pages. No query plan optimization can address this; the only effective intervention is reducing the number of pages each query must touch. We present the Heuristic Search Space Partitioning System (HSSPS), a query-time optimization layer that logically partitions the search space through dynamic predicate injection, without schema modification. A two-phase heuristic engine selects partition key values and scores candidate query plans before execution. A client-side page token maintains cross-partition traversal state without server-side sessions, enabling horizontal scalability. Controlled evaluation across representative query types demonstrates 50-97% P95 latency reduction (95-97% on high-cardinality queries), 8-10x throughput improvement, and 41x reduction in average active sessions. Production rollout across live multi-tenant traffic reduced P95 latency from 61s to 2s across successive releases, sustained over 14,000 eligible queries per measurement window. The technique generalizes to any multi-tenant system where broad queries execute against large shared databases and physical schema modification is impractical.
翻译:大规模云安全平台需要持续查询分布于数千个租户账户中的数百万条结构化云资源记录。跨账户的宽泛查询会使数据库基础设施饱和,导致P95延迟超过60秒。我们通过控制实验发现缓冲区缓存压力是主导延迟因素:具有相同执行计划的同一查询,当工作集驻留内存时仅需3.7秒完成,而并发负载将页面驱逐后则需94秒。任何查询计划优化均无法解决此问题;唯一有效的干预措施是减少每个查询必须访问的页面数量。我们提出启发式搜索空间划分系统(HSSPS),这是一种查询时优化层,通过动态谓词注入逻辑划分搜索空间,无需修改模式。双阶段启发式引擎在执行前选择分区键值并评分候选查询计划。客户端页面令牌维护跨分区遍历状态,无需服务端会话即可实现水平扩展。针对代表性查询类型的受控评估显示:P95延迟降低50-97%(高基数查询降低95-97%),吞吐量提升8-10倍,平均活跃会话数减少41倍。在生产环境中对实时多租户流量进行持续发布,将P95延迟从61秒降至2秒,每个测量窗口内覆盖超过14,000条合格查询。该技术可泛化至任何对大型共享数据库执行宽泛查询且物理模式修改不可行的多租户系统。