Cardinality estimation - calculating the number of distinct elements in a stream - is a longstanding problem with applications from networking to bioinformatics. HyperLogLog (HLL), the prevailing standard, has a well-known error spike in its transition region and requires 6 bits per bucket, with data structure size scaling as B*log(log(cardinality)). We present DynamicLogLog (DLL), which uses a shared exponent across all buckets, storing only relative leading-zero counts. This yields three benefits: (1) only 4 bits per bucket (33% memory reduction), (2) an early exit mask that rejects >99.9% of elements at high cardinality before any bucket access (over 10x faster than HLL when bandwidth-constrained), and (3) a flat error profile via Dynamic Linear Counting (DLC) and a Logarithmic Hybrid Blend that eliminates HLL's transition artifact. Squaring the maximum representable cardinality requires only a single additional bit of global state. At 2,048 buckets with 512k simulations, DLL4's hybrid estimate achieves 1.830% mean and 1.834% peak absolute error using 1,024 bytes, compared to 1.84% mean and 34.1% peak for HLL using 1,536 bytes. DLC achieves 1.90% mean without correction factors. DynamicUltraLogLog (UDLL6), a fusion of DLL and UltraLogLog, achieves ULL-level accuracy at 75% of the memory. History-corrected variants (Hybrid+n) and Layered DLC (LDLC) provide further improvements using per-state correction tables and anti-phase error cancellation.
翻译:基数估计——计算数据流中不同元素的数量——是一个从网络到生物信息学等领域均有应用的长期问题。当前主流标准HyperLogLog (HLL)在其过渡区域存在众所周知的误差尖峰,且每个桶需要6比特,其数据结构大小随B*log(log(cardinality)) 增长。我们提出DynamicLogLog (DLL),它在所有桶间共享一个指数,仅存储相对前导零计数。这带来三大优势:(1) 每个桶仅需4比特(内存减少33%);(2) 早期退出掩码可在高基数时、在访问任何桶之前拒绝超过99.9%的元素(在带宽受限条件下速度比HLL快10倍以上);(3) 通过动态线性计数 (DLC) 和对数混合融合方法获得平坦误差曲线,消除了HLL的过渡伪影。将最大可表示基数平方仅需增加一个全局状态比特。在2,048个桶、512k次模拟条件下,采用1,024字节时DLL4的混合估计达到1.830%的平均绝对误差和1.834%的峰值绝对误差,而HLL采用1,536字节时分别为1.84%的平均误差和34.1%的峰值误差。DLC在不使用校正因子的情况下达到1.90%的平均误差。DynamicUltraLogLog (UDLL6) 融合了DLL与UltraLogLog,在75%内存下即可达到ULL级别的精度。历史校正变体 (Hybrid+n) 和分层DLC (LDLC) 通过基于状态校正表及反相误差抵消进一步提升了性能。