In this research paper, we address the Distinct Elements estimation problem in the context of streaming algorithms. The problem involves estimating the number of distinct elements in a given data stream $\mathcal{A} = (a_1, a_2,\ldots, a_m)$, where $a_i \in \{1, 2, \ldots, n\}$. Over the past four decades, the Distinct Elements problem has received considerable attention, theoretically and empirically, leading to the development of space-optimal algorithms. A recent sampling-based algorithm proposed by Chakraborty et al.[11] has garnered significant interest and has even attracted the attention of renowned computer scientist Donald E. Knuth, who wrote an article on the same topic [6] and called the algorithm CVM. In this paper, we thoroughly examine the algorithms (referred to as CVM1, CVM2 in [6] and DonD, DonD' in [6]. We first unify all these algorithms and call them cutoff-based algorithms. Then we provide an approximation and biasedness analysis of these algorithms.
翻译:在本研究论文中,我们探讨了流式算法语境下的不同元素估计问题。该问题涉及估计给定数据流 $\mathcal{A} = (a_1, a_2,\ldots, a_m)$ 中不同元素的数量,其中 $a_i \in \{1, 2, \ldots, n\}$。过去四十年来,不同元素问题在理论和实证层面均受到广泛关注,由此催生了空间最优算法的发展。Chakraborty 等人近期提出的一种基于抽样的算法[11]引起了广泛兴趣,甚至吸引了著名计算机科学家 Donald E. Knuth 的关注,他针对同一主题撰写了文章[6],并将该算法命名为 CVM。本文深入研究了这些算法(在[6]中分别称为 CVM1、CVM2 以及 DonD、DonD')。我们首先统一了所有算法,并将其统称为截止型算法;随后,我们对这些算法进行了近似性与偏倚性分析。