Since its invention HyperLogLog has become the standard algorithm for approximate distinct counting. Due to its space efficiency and suitability for distributed systems, it is widely used and also implemented in numerous databases. This work presents UltraLogLog, which shares the same practical properties as HyperLogLog. It is commutative, idempotent, mergeable, and has a fast guaranteed constant-time insert operation. At the same time, it requires 28% less space to encode the same amount of distinct count information, which can be extracted using the maximum likelihood method. Alternatively, a simpler and faster estimator is proposed, which still achieves a space reduction of 24%, but at an estimation speed comparable to that of HyperLogLog. In a non-distributed setting where martingale estimation can be used, UltraLogLog is able to reduce space by 17%. Moreover, its smaller entropy and its 8-bit registers lead to better compaction when using standard compression algorithms. All this is verified by experimental results that are in perfect agreement with the theoretical analysis which also outlines potential for even more space-efficient data structures. A production-ready Java implementation of UltraLogLog has been released as part of the open-source Hash4j library.
翻译:自发明以来,HyperLogLog已成为近似去重计数的标准算法。凭借其空间效率和适用于分布式系统的特性,它被广泛使用并已集成到众多数据库中。本文提出UltraLogLog,该算法与HyperLogLog具有相同的实用属性:满足交换律、幂等性、可合并性,并具备快速且保证常数时间的插入操作。同时,编码相同数量的去重计数信息时,UltraLogLog所需空间减少28%,且可通过最大似然法提取信息。此外,本文提出一种更简单、更快速的估计器,仍能实现24%的空间缩减,且估计速度与HyperLogLog相当。在可使用鞅估计的非分布式场景中,UltraLogLog能够将空间缩减17%。此外,其更小的熵值和8位寄存器设计,使得在使用标准压缩算法时能获得更好的压缩效果。所有这些结论均通过实验结果验证,与理论分析高度吻合,理论分析还揭示了构建更高效空间数据结构的前景。一个生产就绪的UltraLogLog Java实现已作为开源Hash4j库的一部分发布。