Sample-based distance-approximation for subsequence-freeness

In this work, we study the problem of approximating the distance to subsequence-freeness in the sample-based distribution-free model. For a given subsequence (word) $w = w_1 \dots w_k$, a sequence (text) $T = t_1 \dots t_n$ is said to contain $w$ if there exist indices $1 \leq i_1 < \dots < i_k \leq n$ such that $t_{i_{j}} = w_j$ for every $1 \leq j \leq k$. Otherwise, $T$ is $w$-free. Ron and Rosin (ACM TOCT 2022) showed that the number of samples both necessary and sufficient for one-sided error testing of subsequence-freeness in the sample-based distribution-free model is $\Theta(k/\epsilon)$. Denoting by $\Delta(T,w,p)$ the distance of $T$ to $w$-freeness under a distribution $p :[n]\to [0,1]$, we are interested in obtaining an estimate $\widehat{\Delta}$, such that $|\widehat{\Delta} - \Delta(T,w,p)| \leq \delta$ with probability at least $2/3$, for a given distance parameter $\delta$. Our main result is an algorithm whose sample complexity is $\tilde{O}(k^2/\delta^2)$. We first present an algorithm that works when the underlying distribution $p$ is uniform, and then show how it can be modified to work for any (unknown) distribution $p$. We also show that a quadratic dependence on $1/\delta$ is necessary.

翻译：本文研究在基于样本的分布自由模型下近似子序列自由性距离的问题。对于给定子序列（单词）$w = w_1 \dots w_k$，序列（文本）$T = t_1 \dots t_n$ 被称为包含 $w$，当且仅当存在索引 $1 \leq i_1 < \dots < i_k \leq n$，使得对于每个 $1 \leq j \leq k$，有 $t_{i_{j}} = w_j$。否则，$T$ 是 $w$-自由的。Ron 和 Rosin (ACM TOCT 2022) 表明，在基于样本的分布自由模型下，用于单边错误测试子序列自由性所需且充分的样本数为 $\Theta(k/\epsilon)$。记 $\Delta(T,w,p)$ 为在分布 $p :[n]\to [0,1]$ 下 $T$ 相对于 $w$-自由性的距离，我们致力于获得一个估计值 $\widehat{\Delta}$，使得对于给定的距离参数 $\delta$，$|\widehat{\Delta} - \Delta(T,w,p)| \leq \delta$ 的概率至少为 $2/3$。我们的主要结果是一个样本复杂度为 $\tilde{O}(k^2/\delta^2)$ 的算法。我们首先提出一个在底层分布 $p$ 为均匀分布时有效的算法，然后展示如何将其修改为适用于任意（未知）分布 $p$。我们还证明了对 $1/\delta$ 的二次依赖是必要的。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

37+阅读 · 2019年10月17日