Since its invention HyperLogLog has become the standard algorithm for approximate distinct counting. Due to its space efficiency and suitability for distributed systems, it is widely used and also implemented in numerous databases. This work presents UltraLogLog, which shares the same practical properties as HyperLogLog. It is commutative, idempotent, mergeable, and has a fast guaranteed constant-time insert operation. At the same time, it requires 28% less space to encode the same amount of distinct count information, which can be extracted using the maximum likelihood method. Alternatively, a simpler and faster estimator is proposed, which still achieves a space reduction of 24%, but at an estimation speed comparable to that of HyperLogLog. In a non-distributed setting where martingale estimation can be used, UltraLogLog is able to reduce space by 17%. Moreover, its smaller entropy and its 8-bit registers lead to better compaction when using standard compression algorithms. All this is verified by experimental results that are in perfect agreement with the theoretical analysis which also outlines potential for even more space-efficient data structures. A production-ready Java implementation of UltraLogLog has been released as part of the open-source Hash4j library.
翻译:自HyperLogLog发明以来,它已成为近似基数计数的标准算法。由于其空间效率高且适用于分布式系统,该算法被广泛使用,并在众多数据库中实现。本文提出UltraLogLog,其具备与HyperLogLog相同的实用特性:满足交换律、幂等性、可合并性,并具有快速且保证常数时间的插入操作。同时,UltraLogLog在编码相同基数信息时可减少28%的空间占用,并可通过极大似然法提取计数信息。此外,本文提出一种更简捷快速的估计器,在保持与HyperLogLog相当的估计速度的同时,仍能实现24%的空间缩减。在可使用鞅估计的非分布式场景中,UltraLogLog可实现17%的空间节省。其更低的熵值与8位寄存器特性,使其在使用标准压缩算法时具有更优的压缩效果。实验结果完全验证了理论分析,同时指出进一步优化数据结构空间效率的潜力。UltraLogLog的生产级Java实现已作为开源Hash4j库的组成部分发布。