Estimating the cardinality (number of distinct elements) of a large multiset is a classic problem in streaming and sketching. In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error. We define a new measure of efficiency for data sketches called the Fisher-Shannon (FiSh) number $\mathcal{H}/\mathcal{I}$. It captures the tension between the limiting Shannon entropy ($\mathcal{H}$) of the sketch and its normalized Fisher information ($\mathcal{I}$) that characterizes the variance of a statistically efficient, asymptotically unbiased estimator. Our aim in introducing the FiSh-number is to build the mathematical machinery necessary to argue for precise optimality, rather than asymptotic optimality, up to large constant factors. Our results are as follows. [1] We prove that all base-$q$ variants of Flajolet and Martin's PCSA sketch have FiSh-number $H_0/I_0 \approx 1.98016$ and that every base-$q$ variant of HyperLogLog has FiSh-number worse than $H_0/I_0$, but that they tend to $H_0/I_0$ in the limit as $q\rightarrow \infty$. Here $H_0,I_0$ are precisely defined constants. [2] We describe a sketch called Fishmonger that is based on a smoothed, entropy-compressed variant of PCSA with a different estimator function. Fishmonger processes a multiset of $[U]$ such that at all times, w.h.p., its space is $(1+o(1))(H_0/I_0)m \approx 1.98m$ bits and its standard error is $1/\sqrt{m}$. For example, to achieve a 1% standard error, one needs a little more than 19,800 bits, or $\approx 2.42$ kilobytes. [3] Finally, we give circumstantial evidence that $H_0/I_0$ is the optimum FiSh-number of mergeable sketches for Cardinality Estimation. We define a natural subset of mergeable sketches called linearizable sketches and prove that no member of this class can beat $H_0/I_0$. The popular mergeable sketches are, in fact, also linearizable.
翻译:基数(不同元素数量)估计是大规模多重集流式处理与草图构建中的经典问题。本文研究草图空间复杂度与其估计误差之间的本质权衡。我们提出一种称为费希尔-香农数(FiSh)$\mathcal{H}/\mathcal{I}$ 的新型数据草图效率度量,该度量捕捉了草图极限香农熵($\mathcal{H}$)与其归一化费希尔信息量($\mathcal{I}$)之间的张力,后者表征了统计有效、渐近无偏估计量的方差。引入 FiSh 数的目标在于建立必要的数学工具,以论证在忽略大常数因子情况下的精确最优性(而非渐近最优性)。我们的研究成果如下:[1] 证明 Flajolet 与 Martin 的 PCSA 草图所有 $q$ 进制变体均具有 FiSh 数 $H_0/I_0 \approx 1.98016$,且 HyperLogLog 的所有 $q$ 进制变体 FiSh 数均劣于 $H_0/I_0$,但当 $q\rightarrow \infty$ 时趋近于 $H_0/I_0$,其中 $H_0,I_0$ 为精确定义的常数。[2] 提出一种基于平滑化、熵压缩的 PCSA 变体并采用不同估计函数的 Fishmonger 草图。该草图处理定义域为 $[U]$ 的多重集时,能以高概率保证其空间占用始终为 $(1+o(1))(H_0/I_0)m \approx 1.98m$ 比特且标准误差为 $1/\sqrt{m}$。例如,为实现 1% 的标准误差,仅需略多于 19,800 比特(约 2.42 千字节)空间。[3] 最终提供间接证据表明 $H_0/I_0$ 是基数估计可合并草图的最优 FiSh 数。我们定义了一类称为可线性化草图的自然可合并子类,并证明该类别中任意草图均无法超越 $H_0/I_0$。事实上,现有主流可合并草图均属于可线性化草图范畴。