Given two strings $S$ and $P$, the Episode Matching problem is to find the shortest substring of $S$ that contains $P$ as a subsequence. The best known upper bound for this problem is $\tilde O(nm)$ by Das et al. (1997) , where $n,m$ are the lengths of $S$ and $P$, respectively. Although the problem is well studied and has many applications in data mining, this bound has never been improved. In this paper we show why this is the case by proving that no $O((nm)^{1-\epsilon})$ algorithm (even for binary strings) exists, unless the Strong Exponential Time Hypothesis (SETH) is false. We then consider the indexing version of the problem, where $S$ is preprocessed into a data structure for answering episode matching queries $P$. We show that for any $\tau$, there is a data structure using $O(n+\left(\frac{n}{\tau}\right)^k)$ space that answers episode matching queries for any $P$ of length $k$ in $O(k\cdot \tau \cdot \log \log n )$ time. We complement this upper bound with an almost matching lower bound, showing that any data structure that answers episode matching queries for patterns of length $k$ in time $O(n^\delta)$, must use $\Omega(n^{k-k\delta-o(1)})$ space, unless the Strong $k$-Set Disjointness Conjecture is false. Finally, for the special case of $k=2$, we present a faster construction of the data structure using fast min-plus multiplication of bounded integer matrices.
翻译:给定两个字符串$S$和$P$,序列片段匹配问题旨在寻找$S$中包含$P$作为子序列的最短子串。Das等人(1997)提出的最优已知上界为$\tilde O(nm)$,其中$n$和$m$分别表示$S$和$P$的长度。尽管该问题已被广泛研究且在数据挖掘领域具有众多应用,但该界限至今未被改进。本文通过证明:若强指数时间假说(SETH)成立,则不存在$O((nm)^{1-\epsilon})$算法(即使对于二进制字符串),揭示了这一现象的内在原因。随后我们研究该问题的索引化版本——将$S$预处理为数据结构以回答序列片段匹配查询$P$。研究表明:对任意$\tau$,存在空间复杂度为$O(n+\left(\frac{n}{\tau}\right)^k)$的数据结构,可在$O(k\cdot \tau \cdot \log \log n )$时间内回答任意长度为$k$的$P$的序列片段匹配查询。我们为该上界补充了近乎匹配的下界:若强$k$集不相交猜想成立,则任何能在$O(n^\delta)$时间内回答长度为$k$的模式序列片段匹配查询的数据结构,必须使用$\Omega(n^{k-k\delta-o(1)})$空间。最后,针对$k=2$的特殊情形,我们通过有界整数矩阵的快速min-plus乘法提出了该数据结构的高效构建方法。