Pattern Masking for Dictionary Matching

In the Pattern Masking for Dictionary Matching (PMDM) problem, we are given a dictionary $\mathcal{D}$ of $d$ strings, each of length $\ell$, a query string $q$ of length $\ell$, and a positive integer $z$, and we are asked to compute a smallest set $K\subseteq\{1,\ldots,\ell\}$, so that if $q[i]$, for all $i\in K$, is replaced by a wildcard, then $q$ matches at least $z$ strings from $\mathcal{D}$. The PMDM problem lies at the heart of two important applications featured in large-scale real-world systems: record linkage of databases that contain sensitive information, and query term dropping. In both applications, solving PMDM allows for providing data utility guarantees as opposed to existing approaches. We first show, through a reduction from the well-known $k$-Clique problem, that a decision version of the PMDM problem is NP-complete, even for strings over a binary alphabet. We present a data structure for PMDM that answers queries over $\mathcal{D}$ in time $\mathcal{O}(2^{\ell/2}(2^{\ell/2}+\tau)\ell)$ and requires space $\mathcal{O}(2^{\ell}d^2/\tau^2+2^{\ell/2}d)$, for any parameter $\tau\in[1,d]$. We also approach the problem from a more practical perspective. We show an $\mathcal{O}((d\ell)^{k/3}+d\ell)$-time and $\mathcal{O}(d\ell)$-space algorithm for PMDM if $k=|K|=\mathcal{O}(1)$. We generalize our exact algorithm to mask multiple query strings simultaneously. We complement our results by showing a two-way polynomial-time reduction between PMDM and the Minimum Union problem [Chlamt\'{a}\v{c} et al., SODA 2017]. This gives a polynomial-time $\mathcal{O}(d^{1/4+\epsilon})$-approximation algorithm for PMDM, which is tight under plausible complexity conjectures.

翻译：在模式掩码词典匹配（PMDM）问题中，给定一个包含 $d$ 个字符串的词典 $\mathcal{D}$，每个字符串长度为 $\ell$，一个长度为 $\ell$ 的查询字符串 $q$，以及一个正整数 $z$，我们需要计算一个最小的集合 $K\subseteq\{1,\ldots,\ell\}$，使得将 $q[i]$（对所有 $i\in K$）替换为通配符后，$q$ 能够匹配 $\mathcal{D}$ 中至少 $z$ 个字符串。PMDM 问题是两个重要应用的核心，这些应用出现在大规模现实系统中：包含敏感信息的数据库的记录链接，以及查询词丢弃。在这两个应用中，解决 PMDM 问题可以提供数据效用保证，而现有方法则无法做到。我们首先通过从著名的 $k$-Clique 问题归约证明，即使字符串基于二元字母表，PMDM 问题的决策版本也是 NP 完全的。我们提出了一种用于 PMDM 的数据结构，该结构在 $\mathcal{O}(2^{\ell/2}(2^{\ell/2}+\tau)\ell)$ 时间内回答关于 $\mathcal{D}$ 的查询，并需要 $\mathcal{O}(2^{\ell}d^2/\tau^2+2^{\ell/2}d)$ 的空间，其中 $\tau\in[1,d]$ 是任意参数。我们还从更实际的角度处理该问题。我们展示了一个 $\mathcal{O}((d\ell)^{k/3}+d\ell)$ 时间和 $\mathcal{O}(d\ell)$ 空间的 PMDM 算法，其中 $k=|K|=\mathcal{O}(1)$。我们将精确算法推广到同时掩码多个查询字符串。我们通过展示 PMDM 与最小并集问题 [Chlamt\'{a}\v{c} et al., SODA 2017] 之间的双向多项式时间归约来补充我们的结果。这给出了一个多项式时间的 $\mathcal{O}(d^{1/4+\epsilon})$-近似算法用于 PMDM，该算法在合理的复杂性猜想下是紧的。