Since its invention HyperLogLog has become the standard algorithm for approximate distinct counting. Due to its space efficiency and suitability for distributed systems, it is widely used and also implemented in numerous databases. This work presents UltraLogLog, which shares the same practical properties as HyperLogLog. It is commutative, idempotent, mergeable, and has a fast guaranteed constant-time insert operation. At the same time, it requires 28% less space to encode the same amount of distinct count information, which can be extracted using the maximum likelihood method. Alternatively, a simpler and faster estimator is proposed, which still achieves a space reduction of 24%, but at an estimation speed comparable to that of HyperLogLog. In a non-distributed setting where martingale estimation can be used, UltraLogLog is able to reduce space by 17%. Moreover, its smaller entropy and its 8-bit registers lead to better compaction when using standard compression algorithms. All this is verified by experimental results that are in perfect agreement with the theoretical analysis which also outlines potential for even more space-efficient data structures. A production-ready Java implementation of UltraLogLog has been released as part of the open-source Hash4j library.
翻译:自发明以来,HyperLogLog已成为近似基数计数的标准算法。由于其空间高效性及对分布式系统的适用性,该算法被广泛应用并已在众多数据库中实现。本文提出UltraLogLog,它拥有与HyperLogLog相同的实用特性:可交换、幂等、可合并,且具备快速且性能有保证的常数时间插入操作。同时,它在编码相同数量的不同计数信息时所需空间减少28%,这些信息可通过最大似然法提取。此外,本文提出一种更简单、更快速的估计器,其估计速度与HyperLogLog相当,但空间缩减幅度仍可达24%。在可运用鞅估计的非分布式场景中,UltraLogLog能将空间节省17%。再者,其更小的熵值以及8位寄存器设计使得使用标准压缩算法时能实现更高压缩比。上述结论均通过实验结果得到验证,实验结果与理论分析完全吻合,理论分析同时揭示了构建更高效空间数据结构的潜在可能。作为开源Hash4j库的一部分,一个已用于生产环境的UltraLogLog Java实现已正式发布。