Traditionally in the turnstile model of data streams, there is a state vector $x=(x_1,x_2,\ldots,x_n)$ which is updated through a stream of pairs $(i,k)$ where $i\in [n]$ and $k\in \Z$. Upon receiving $(i,k)$, $x_i\gets x_i + k$. A distinct count algorithm in the turnstile model takes one pass of the stream and then estimates $\norm{x}_0 = |\{i\in[n]\mid x_i\neq 0\}|$ (aka $L_0$, the Hamming norm). In this paper, we define a finite-field version of the turnstile model. Let $F$ be any finite field. Then in the $F$-turnstile model, for each $i\in [n]$, $x_i\in F$; for each update $(i,k)$, $k\in F$. The update $x_i\gets x_i+k$ is then computed in the field $F$. A distinct count algorithm in the $F$-turnstile model takes one pass of the stream and estimates $\norm{x}_{0;F} = |\{i\in[n]\mid x_i\neq 0_F\}|$. We present a simple distinct count algorithm, called $F$-\pcsa{}, in the $F$-turnstile model for any finite field $F$. The new $F$-\pcsa{} algorithm takes $m\log(n)\log (|F|)$ bits of memory and estimates $\norm{x}_{0;F}$ with $O(\frac{1}{\sqrt{m}})$ relative error where the hidden constant depends on the order of the field. $F$-\pcsa{} is straightforward to implement and has several applications in the real world with different choices of $F$. Most notably, it makes distinct count with deletions as simple as distinct count without deletions.
翻译:传统上,在数据流的转闸门模型中,存在一个状态向量 $x=(x_1,x_2,\ldots,x_n)$,通过一对 $(i,k)$ 的流进行更新,其中 $i\in [n]$ 且 $k\in \Z$。接收到 $(i,k)$ 后,$x_i\gets x_i + k$。转闸门模型中的不同计数算法对数据流进行一次扫描,然后估计 $\norm{x}_0 = |\{i\in[n]\mid x_i\neq 0\}|$(即 $L_0$,汉明范数)。本文定义了转闸门模型的有限域版本。设 $F$ 为任意有限域。则在 $F$-转闸门模型中,对于每个 $i\in [n]$,$x_i\in F$;对于每次更新 $(i,k)$,$k\in F$。更新 $x_i\gets x_i+k$ 在域 $F$ 中进行计算。$F$-转闸门模型中的不同计数算法对数据流进行一次扫描,并估计 $\norm{x}_{0;F} = |\{i\in[n]\mid x_i\neq 0_F\}|$。我们提出了一种简单的不同计数算法,称为 $F$-\pcsa{},适用于任意有限域 $F$ 的 $F$-转闸门模型。新的 $F$-\pcsa{} 算法占用 $m\log(n)\log (|F|)$ 比特内存,并以 $O(\frac{1}{\sqrt{m}})$ 的相对误差估计 $\norm{x}_{0;F}$,其中隐藏常数取决于域的阶数。$F$-\pcsa{} 易于实现,并在实际应用中有多种不同 $F$ 选择的应用场景。最值得注意的是,它使带删除的不同计数与不带删除的不同计数一样简单。