Since its invention HyperLogLog has become the standard algorithm for approximate distinct counting. Due to its space efficiency and suitability for distributed systems, it is widely used and also implemented in numerous databases. This work presents UltraLogLog, which shares the same practical properties as HyperLogLog. It is commutative, idempotent, mergeable, and has a fast guaranteed constant-time insert operation. At the same time, it requires 28% less space to encode the same amount of distinct count information, which can be extracted using the maximum likelihood method. Alternatively, a simpler and faster estimator is proposed, which still achieves a space reduction of 24%, but at an estimation speed comparable to that of HyperLogLog. In a non-distributed setting where martingale estimation can be used, UltraLogLog is able to reduce space by 17%. Moreover, its smaller entropy and its 8-bit registers lead to better compaction when using standard compression algorithms. All this is verified by experimental results that are in perfect agreement with the theoretical analysis which also outlines potential for even more space-efficient data structures. A production-ready Java implementation of UltraLogLog has been released as part of the open-source Hash4j library.
翻译:自HyperLogLog发明以来,它已成为近似基数计数的标准算法。由于其空间效率和适用于分布式系统的特性,该算法被广泛使用并已集成到众多数据库中。本文提出了UltraLogLog,它具备与HyperLogLog相同的实用属性:可交换性、幂等性、可合并性,并支持快速的恒定时间插入操作。同时,编码相同数量的基数信息时,UltraLogLog所需空间减少28%,且可通过最大似然法提取信息。此外,我们提出了一种更简单、速度更快的估计器,其空间节省仍达24%,且估计速度与HyperLogLog相当。在可使用鞅估计的非分布式场景中,UltraLogLog可进一步节省17%的空间。得益于更小的信息熵和8位寄存器设计,该算法在使用标准压缩算法时能实现更高的压缩率。实验结果与理论分析完全吻合,理论分析还揭示了构建更节省空间的数据结构的潜力。UltraLogLog的生产级Java实现已作为开源Hash4j库的一部分发布。