In this research paper, we address the Distinct Elements estimation problem in the context of streaming algorithms. The problem involves estimating the number of distinct elements in a given data stream $\mathcal{A} = (a_1, a_2,\ldots, a_m)$, where $a_i \in \{1, 2, \ldots, n\}$. Over the past four decades, the Distinct Elements problem has received considerable attention, theoretically and empirically, leading to the development of space-optimal algorithms. A recent sampling-based algorithm proposed by Chakraborty et al.[11] has garnered significant interest and has even attracted the attention of renowned computer scientist Donald E. Knuth, who wrote an article on the same topic [6] and called the algorithm CVM. In this paper, we thoroughly examine the algorithms (referred to as CVM1, CVM2 in [11] and DonD, DonD' in [6]. We first unify all these algorithms and call them cutoff-based algorithms. Then we provide an approximation and biasedness analysis of these algorithms.
翻译:在本研究论文中,我们针对流算法场景下的不同元素估计问题展开讨论。该问题涉及对给定数据流$\mathcal{A} = (a_1, a_2,\ldots, a_m)$中不同元素数量的估计,其中$a_i \in \{1, 2, \ldots, n\}$。过去四十年间,不同元素问题在理论和实证层面均受到广泛关注,并由此发展出空间最优算法。Chakraborty等人[11]近期提出的一种基于采样的算法引起学界浓厚兴趣,甚至吸引了著名计算机科学家Donald E. Knuth的关注——后者曾就该主题撰写专文[6],并将该算法命名为CVM。本文系统研究了这些算法(在[11]中分别称为CVM1、CVM2,在[6]中称为DonD、DonD')。我们首先将这些算法统一归类为"截断型算法",随后对这些算法的近似性与偏差性展开分析。