1.5 Million Messages Per Second on 3 Machines: Benchmarking and Latency Optimization of Apache Pulsar at Enterprise Scale

This paper presents two independent contributions for Apache Pulsar practitioners. First, we validate 1,499,947 msg/s at 3.88 ms median publish latency on just three bare-metal Kubernetes nodes running Pulsar 4.0.8 with Java 21 and ZGC Generational garbage collection, and project a hardware-driven path to 15 million msg/s on 15 machines using five independent clusters with key-based partition routing. Hardware selection -- specifically dedicated NVMe journals achieving 0.02 ms fdatasync and 25 Gbps network interfaces -- is the primary determinant of throughput ceiling, not compute or software tuning. Second, we trace the complete latency optimization journey from 213 ms GC spikes and 13-18 ms median publish latency in production to 3.88 ms through Java Flight Recorder guided root cause analysis. Three independent root causes are identified and resolved: G1GC pauses eliminated by switching to ZGC Generational; journal fdatasync latency reduced from 5.1 ms to 0.02 ms through NVMe journal dedication; and a previously undocumented Linux kernel page cache writeback interaction inside BookKeeper's ForceWriteThread that degrades fdatasync from under 1 ms to 15-22 ms even across physically separate NVMe drives sharing the kernel block layer. This finding is undocumented in official Apache Pulsar and BookKeeper documentation and is relevant to all Pulsar operators experiencing unexplained P99.9 latency spikes. The combined optimizations achieve a 4.7x latency improvement at 50x higher throughput.

翻译：本文为Apache Pulsar实践者提供两项独立贡献。首先，我们在仅三台裸机Kubernetes节点上运行Pulsar 4.0.8（Java 21 + ZGC分代垃圾回收），验证了每秒1,499,947条消息的吞吐量，其中位发布延迟为3.88毫秒；并基于硬件驱动路径预测，通过五个独立集群配合基于键的分区路由，可在15台机器上实现每秒1500万条消息。硬件选型——特别是实现0.02毫秒fdatasync的专用NVMe日志和25 Gbps网络接口——是吞吐量上限的主要决定因素，而非计算或软件调优。其次，我们完整追溯了延迟优化历程：从生产环境中213毫秒的GC尖峰和13-18毫秒的中位发布延迟，通过Java飞行记录器引导的根因分析，最终降至3.88毫秒。识别并解决了三个独立根因：通过切换至ZGC分代消除G1GC暂停；通过NVMe日志专用化将fdatasync延迟从5.1毫秒降至0.02毫秒；以及一个此前未记录的Linux内核页面缓存回写交互问题（位于BookKeeper的ForceWriteThread内部），该问题即使在使用共享内核块层的物理隔离NVMe驱动器时，也会将fdatasync延迟从1毫秒以下劣化至15-22毫秒。这一发现未记录在官方Apache Pulsar和BookKeeper文档中，对所有遭遇不明P99.9延迟尖峰的Pulsar运维人员具有参考价值。综合优化在50倍吞吐量提升下实现了4.7倍的延迟改善。