Many streaming algorithms provide only a high-probability relative approximation. These two relaxations, of allowing approximation and randomization, seem necessary -- for many streaming problems, both relaxations must be employed simultaneously, to avoid an exponentially larger (and often trivial) space complexity. A common drawback of these randomized approximate algorithms is that independent executions on the same input have different outputs, that depend on their random coins. Pseudo-deterministic algorithms combat this issue, and for every input, they output with high probability the same ``canonical'' solution. We consider perhaps the most basic problem in data streams, of counting the number of items in a stream of length at most $n$. Morris's counter [CACM, 1978] is a randomized approximation algorithm for this problem that uses $O(\log\log n)$ bits of space, for every fixed approximation factor (greater than $1$). Goldwasser, Grossman, Mohanty and Woodruff [ITCS 2020] asked whether pseudo-deterministic approximation algorithms can match this space complexity. Our main result answers their question negatively, and shows that such algorithms must use $\Omega(\sqrt{\log n / \log\log n})$ bits of space. Our approach is based on a problem that we call Shift Finding, and may be of independent interest. In this problem, one has query access to a shifted version of a known string $F\in\{0,1\}^{3n}$, which is guaranteed to start with $n$ zeros and end with $n$ ones, and the goal is to find the unknown shift using a small number of queries. We provide for this problem an algorithm that uses $O(\sqrt{n})$ queries. It remains open whether $poly(\log n)$ queries suffice; if true, then our techniques immediately imply a nearly-tight $\Omega(\log n/\log\log n)$ space bound for pseudo-deterministic approximate counting.
翻译:许多流式算法仅能提供高概率的相对近似。允许近似和随机化这两种松弛看似必要——对于许多流式问题,必须同时采用这两种松弛,以避免指数级更大的(且通常平凡的)空间复杂度。这些随机近似算法的一个常见缺点是对同一输入独立执行时会得到依赖于其随机硬币的不同输出。伪确定性算法通过以高概率输出相同的“规范”解来应对这一问题。我们考虑数据流中最基本的问题:统计长度至多为$n$的流中项目数量。Morris计数器[CACM, 1978]是解决该问题的随机近似算法,对每个固定近似因子(大于1)仅需$O(\log\log n)$比特空间。Goldwasser、Grossman、Mohanty和Woodruff [ITCS 2020]曾提问伪确定性近似算法是否能匹配这一空间复杂度。我们的主要结果否定了该问题,并表明此类算法必须使用$\Omega(\sqrt{\log n / \log\log n})$比特空间。我们的方法基于一个称为“移位查找”的问题,该问题可能具有独立研究价值。在此问题中,研究者可以查询访问已知字符串$F\in\{0,1\}^{3n}$的移位版本(保证以$n$个零开头并以$n$个一结尾),目标是通过少量查询找出未知移位。我们为此问题提供了一种使用$O(\sqrt{n})$次查询的算法。至于是否$poly(\log n)$次查询即足够仍悬而未决;若成立,则我们的技术将立即推导出伪确定性近似计数近似紧的$\Omega(\log n/\log\log n)$空间下界。