Let $W$ be a string of length $n$ over an alphabet $\Sigma$, $k$ be a positive integer, and $\mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{\mathrm{ED}}$ such that: (i) no string of $\mathcal{S}$ occurs in $X_{\mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $\Sigma$ (and thus the frequency) is the same in $W$ and in $X_{\mathrm{ED}}$; and (iii) $X_{\mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $\mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in $\mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $\mathcal{O}(n^{2-\delta})$ time, for any $\delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $\mathcal{O}(n^2\log^2k)$-time algorithm to solve ETFS; and (ii) an $\mathcal{O}(n^2\log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $\mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.
翻译:设 $W$ 为字母表 $\Sigma$ 上长度为 $n$ 的字符串,$k$ 为正整数,$\mathcal{S}$ 为 $W$ 中所有长度为 $k$ 的子串集合。ETFS 问题要求构造字符串 $X_{\mathrm{ED}}$,满足:(i) $\mathcal{S}$ 中任意子串不出现在 $X_{\mathrm{ED}}$ 中;(ii) $\Sigma$ 上所有其他长度为 $k$ 子串(因此其频率)在 $W$ 和 $X_{\mathrm{ED}}$ 中的顺序相同;(iii) $X_{\mathrm{ED}}$ 与 $W$ 的编辑距离最小。当 $W$ 代表个体数据、$\mathcal{S}$ 代表机密模式集合时,ETFS 问题旨在对 $W$ 进行变换以保护其隐私和效用 [Bernardini et al., ECML PKDD 2019]。ETFS 可在 $\mathcal{O}(n^2k)$ 时间内求解 [Bernardini et al., CPM 2020]。同一论文表明,除非强指数时间假设 (SETH) 为假,否则无法在 $\mathcal{O}(n^{2-\delta})$ 时间内求解 ETFS(对任意 $\delta>0$)。我们的主要结果可概括如下:(i) 求解 ETFS 的 $\mathcal{O}(n^2\log^2k)$ 时间算法;(ii) 求解 AETFS($\mathcal{S}$ 中元素长度可任意的 ETFS 泛化问题)的 $\mathcal{O}(n^2\log^2n)$ 时间算法。因此,除非 SETH 失效,我们的算法在多项式对数因子内达到最优。除字符串净化外,我们的技术可为与正则表达式或上下文无关文法相关的其他问题提供解决思路。