In this work, we study the problem of approximating the distance to subsequence-freeness in the sample-based distribution-free model. For a given subsequence (word) $w = w_1 \dots w_k$, a sequence (text) $T = t_1 \dots t_n$ is said to contain $w$ if there exist indices $1 \leq i_1 < \dots < i_k \leq n$ such that $t_{i_{j}} = w_j$ for every $1 \leq j \leq k$. Otherwise, $T$ is $w$-free. Ron and Rosin (ACM TOCT 2022) showed that the number of samples both necessary and sufficient for one-sided error testing of subsequence-freeness in the sample-based distribution-free model is $\Theta(k/\epsilon)$. Denoting by $\Delta(T,w,p)$ the distance of $T$ to $w$-freeness under a distribution $p :[n]\to [0,1]$, we are interested in obtaining an estimate $\widehat{\Delta}$, such that $|\widehat{\Delta} - \Delta(T,w,p)| \leq \delta$ with probability at least $2/3$, for a given distance parameter $\delta$. Our main result is an algorithm whose sample complexity is $\tilde{O}(k^2/\delta^2)$. We first present an algorithm that works when the underlying distribution $p$ is uniform, and then show how it can be modified to work for any (unknown) distribution $p$. We also show that a quadratic dependence on $1/\delta$ is necessary.
翻译:本文研究在基于样本的分布自由模型下近似子序列自由性距离的问题。对于给定子序列(单词)$w = w_1 \dots w_k$,序列(文本)$T = t_1 \dots t_n$ 被称为包含 $w$,当且仅当存在索引 $1 \leq i_1 < \dots < i_k \leq n$,使得对于每个 $1 \leq j \leq k$,有 $t_{i_{j}} = w_j$。否则,$T$ 是 $w$-自由的。Ron 和 Rosin (ACM TOCT 2022) 表明,在基于样本的分布自由模型下,用于单边错误测试子序列自由性所需且充分的样本数为 $\Theta(k/\epsilon)$。记 $\Delta(T,w,p)$ 为在分布 $p :[n]\to [0,1]$ 下 $T$ 相对于 $w$-自由性的距离,我们致力于获得一个估计值 $\widehat{\Delta}$,使得对于给定的距离参数 $\delta$,$|\widehat{\Delta} - \Delta(T,w,p)| \leq \delta$ 的概率至少为 $2/3$。我们的主要结果是一个样本复杂度为 $\tilde{O}(k^2/\delta^2)$ 的算法。我们首先提出一个在底层分布 $p$ 为均匀分布时有效的算法,然后展示如何将其修改为适用于任意(未知)分布 $p$。我们还证明了对 $1/\delta$ 的二次依赖是必要的。