The goal of the trace reconstruction problem is to recover a string $x\in\{0,1\}^n$ given many independent {\em traces} of $x$, where a trace is a subsequence obtained from deleting bits of $x$ independently with some given probability $p\in [0,1).$ A recent result of Chase (STOC 2021) shows how $x$ can be determined (in exponential time) from $\exp(\widetilde{O}(n^{1/5}))$ traces. This is the state-of-the-art result on the sample complexity of trace reconstruction. In this paper we consider two kinds of algorithms for the trace reconstruction problem. Our first, and technically more involved, result shows that any $k$-mer-based algorithm for trace reconstruction must use $\exp(\Omega(n^{1/5}))$ traces, under the assumption that the estimator requires $poly(2^k, 1/\varepsilon)$ traces, thus establishing the optimality of this number of traces. The analysis of this result also shows that the analysis technique used by Chase (STOC 2021) is essentially tight, and hence new techniques are needed in order to improve the worst-case upper bound. Our second, simple, result considers the performance of the Maximum Likelihood Estimator (MLE), which specifically picks the source string that has the maximum likelihood to generate the samples (traces). We show that the MLE algorithm uses a nearly optimal number of traces, \ie, up to a factor of $n$ in the number of samples needed for an optimal algorithm, and show that this factor of $n$ loss may be necessary under general ``model estimation'' settings.
翻译:迹重建问题的目标是:给定字符串$x\in\{0,1\}^n$的多个独立{\em 迹}(即通过以给定概率$p\in[0,1)$独立删除$x$的比特位得到的子序列),恢复出原字符串$x$。Chase近期成果(STOC 2021)表明,在指数时间内可从$\exp(\widetilde{O}(n^{1/5}))$个迹中确定$x$,这代表了迹重建样本复杂度领域的最前沿成果。本文考虑两类迹重建算法。首先,在技术层面更复杂的结论表明:若假设估计器需要$poly(2^k, 1/\varepsilon)$个迹,则任何基于k-mer的迹重建算法必须使用$\exp(\Omega(n^{1/5}))$个迹,从而确立该迹数的最优性。该结论的分析同时表明Chase(STOC 2021)使用的分析技术本质上是紧致的,因此改进最坏情况上界需引入新技术。其次,我们通过简洁结论考察最大似然估计器(MLE)的性能——该估计器专门选取能最大概率生成样本(迹)的源字符串。研究表明MLE算法使用的迹数近乎最优(相较于最优算法仅差$n$因子),且该$n$因子损失在一般"模型估计"框架下可能是不可避免的。