Real-time OLAP datastores are critical infrastructure for modern enterprises, powering interactive analytics on petabyte-scale datasets with subsecond latency requirements. As these systems become integral to service architectures, maintaining strict SLAs under failures, load spikes, and cluster changes is as important as raw performance. We present a set of resiliency mechanisms developed for Apache Pinot at LinkedIn, applicable to modern OLAP systems broadly. We introduce Query Workload Isolation (QWI), which provides workload-level CPU and memory budgeting across Pinot's broker and server tiers via fine-grained resource accounting and sub-millisecond enforcement, delivering predictable tail latency and fairness with under 1% overhead. We present Impact-Free Rebalancing for SLA-safe data movement during routine operations (e.g., upgrades, scale-out, and recovery), and Maintenance Zone Awareness to place replicas across fault domains and mitigate correlated failures. We also describe Adaptive Server Selection, which routes queries using real-time load and performance signals to avoid slow or failing nodes while preserving balanced utilization. Together, these mechanisms form a holistic resiliency framework deployed in production at LinkedIn, enabling stable query latency and high availability at scale.
翻译:实时OLAP数据存储是现代企业的关键基础设施,其能够在亚秒级延迟要求下对PB级数据集进行交互式分析。随着这些系统成为服务架构不可或缺的组成部分,在故障、负载激增和集群变更期间维持严格的SLA(服务等级协议)变得与原始性能同等重要。我们介绍一套为LinkedIn的Apache Pinot开发的鲁棒性机制,这些机制广泛适用于现代OLAP系统。我们提出了查询工作负载隔离(QWI),它通过细粒度资源核算和亚毫秒级执行,在Pinot的代理层和服务器层提供工作负载级别的CPU与内存预算,以低于1%的开销实现可预测的尾部延迟和公平性。我们介绍了无影响重平衡机制,用于在常规操作(如升级、扩容和恢复)期间进行SLA安全的数据迁移,以及维护区域感知机制,用于将副本跨故障域放置以缓解关联性故障。我们还描述了自适应服务器选择机制,该机制利用实时负载和性能信号来路由查询,从而避开缓慢或故障节点,同时保持均衡的利用率。这些机制共同构成了一个在LinkedIn生产环境中部署的整体鲁棒性框架,实现了大规模下的稳定查询延迟和高可用性。