The goal of the trace reconstruction problem is to recover a string $x\in\{0,1\}^n$ given many independent {\em traces} of $x$, where a trace is a subsequence obtained from deleting bits of $x$ independently with some given probability $p\in [0,1).$ A recent result of Chase (STOC 2021) shows how $x$ can be determined (in exponential time) from $\exp(\widetilde{O}(n^{1/5}))$ traces. This is the state-of-the-art result on the sample complexity of trace reconstruction. In this paper we consider two kinds of algorithms for the trace reconstruction problem. Our first, and technically more involved, result shows that any $k$-mer-based algorithm for trace reconstruction must use $\exp(\Omega(n^{1/5}))$ traces, under the assumption that the estimator requires $poly(2^k, 1/\varepsilon)$ traces, thus establishing the optimality of this number of traces. The analysis of this result also shows that the analysis technique used by Chase (STOC 2021) is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, \ie, up to a factor of $n$ in the number of samples needed for an optimal algorithm, and show that this factor of $n$ loss may be necessary under general ``model estimation'' settings.
翻译:迹重建问题的目标是:给定一个字符串 $x\in\{0,1\}^n$ 的多个独立“迹”(其中,迹是通过以固定概率 $p\in [0,1)$ 独立删除 $x$ 的比特位而获得的子序列),恢复出 $x$。Chase (STOC 2021) 的最新结果表明,可以从 $\exp(\widetilde{O}(n^{1/5}))$ 个迹中(在指数时间内)确定 $x$。这是迹重建问题样本复杂度方面的当前最优结果。本文考虑了两种用于迹重建问题的算法。首先,技术性更强的结果表明:任何基于k-mer的迹重建算法,在假设估计器需要 $poly(2^k, 1/\varepsilon)$ 个迹的条件下,必须使用 $\exp(\Omega(n^{1/5}))$ 个迹,从而确立了该迹数量的最优性。该结果的分析还表明,Chase (STOC 2021) 使用的分析技术本质上是紧的,因此需要新的技术来改进最坏情况的上界。其次,一个简单的结果考虑了最大似然估计器(MLE)的性能,该估计器专门选择具有最大似然生成样本(迹)的源字符串。我们证明:MLE算法使用的迹数量几乎是渐进最优的,即与最优算法所需的样本数量相比仅差一个 $n$ 因子,并且表明在一般的“模型估计”设置下,这个 $n$ 因子的损失可能是不可避免的。