Since its invention HyperLogLog has become the standard algorithm for approximate distinct counting. Due to its space efficiency and suitability for distributed systems, it is widely used and also implemented in numerous databases. This work presents UltraLogLog, which shares the same practical properties as HyperLogLog. It is commutative, idempotent, mergeable, and has a fast guaranteed constant-time insert operation. At the same time, it requires 28% less space to encode the same amount of distinct count information, which can be extracted using the maximum likelihood method. Alternatively, a simpler and faster estimator is proposed, which still achieves a space reduction of 24%, but at an estimation speed comparable to that of HyperLogLog. In a non-distributed setting where martingale estimation can be used, UltraLogLog is able to reduce space by 17%. Moreover, its smaller entropy and its 8-bit registers lead to better compaction when using standard compression algorithms. All this is verified by experimental results that are in perfect agreement with the theoretical analysis which also outlines potential for even more space-efficient data structures. A production-ready Java implementation of UltraLogLog has been released as part of the open-source Hash4j library.
翻译:自HyperLogLog提出以来,它已成为近似独立计数的标准算法。由于其在空间效率和分布式系统适用性方面的优势,该算法被广泛使用并集成于众多数据库系统中。本文提出UltraLogLog算法,该算法具有与HyperLogLog相同的实用特性:满足交换律、幂等性、可合并性,且具备快速的常数时间插入操作保证。与此同时,在编码相同数量独立计数信息时,UltraLogLog所需空间减少28%,且可通过最大似然方法提取计数信息。另一种更简单快速的估计器同样可将空间减少24%,且估计速度与HyperLogLog相当。在可运用鞅估计的非分布式场景中,UltraLogLog能减少17%的空间消耗。此外,其更低的熵值与8位寄存器设计使得标准压缩算法能实现更优的压缩效果。实验结果表明,这些结论与理论分析高度吻合,理论分析同时揭示了构建更节省空间的数据结构的潜在可能性。作为开源Hash4j库的组成部分,已发布生产就绪的UltraLogLog Java实现。