Subsequence Matching and LCS with Segment Number Constraints

The longest common subsequence (LCS) is a fundamental problem in string processing which has numerous algorithmic studies, extensions, and applications. A sequence $u_1, \ldots, u_f$ of $f$ strings s said to be an ($f$-)segmentation of a string $P$ if $P = u_1 \cdots u_f$. Li et al. [BIBM 2022] proposed a new variant of the LCS problem for given strings $T_1, T_2$ and an integer $f$, which we hereby call the segmental LCS problem (SegLCS), of finding (the length of) a longest string $P$ that has an $f$-segmentation which can be embedded into both $T_1$ and $T_2$. Li et al. [IJTCS-FAW 2024] gave a dynamic programming solution that solves SegLCS in $O(fn_1n_2)$ time with $O(fn_1 + n_2)$ space, where $n_1 = |T_1|$, $n_2 = |T_2|$, and $n_1 \le n_2$. Recently, Banerjee et al. [ESA 2024] presented an algorithm which, for a constant $f \geq 3$, solves SegLCS in $\tilde{O}((n_1n_2)^{1-(1/3)^{f-2}})$ time. In this paper, we deal with SegLCS as well as the problem of segmental subsequence pattern matching, SegE, that asks to determine whether a pattern $P$ of length $m$ has an $f$-segmentation that can be embedded into a text $T$ of length $n$. When $f = 1$, this is equivalent to substring matching, and when $f = |P|$, this is equivalent to subsequence matching. Our focus in this article is the case of general values of $f$, and our main contributions are threefold: (1) $O((mn)^{1-\epsilon})$-time conditional lower bound for SegE under the strong exponential-time hypothesis (SETH), for any constant $\epsilon > 0$. (2) $O(mn)$-time algorithm for SegE. (3) $O(fn_2(n_1 - \ell+1))$-time algorithm for SegLCS where $\ell$ is the solution length.

翻译：最长公共子序列（LCS）是字符串处理中的一个基本问题，已有大量算法研究、扩展与应用。若字符串序列 $u_1, \ldots, u_f$ 满足 $P = u_1 \cdots u_f$，则称其为字符串 $P$ 的一个（$f$ 段）分割。Li 等人 [BIBM 2022] 针对给定字符串 $T_1, T_2$ 与整数 $f$ 提出了 LCS 问题的一个新变种，本文称之为分段最长公共子序列问题（SegLCS），其目标是寻找（长度最大的）字符串 $P$，使得 $P$ 存在一个 $f$ 段分割，且该分割可同时嵌入 $T_1$ 与 $T_2$。Li 等人 [IJTCS-FAW 2024] 给出了一种动态规划解法，可在 $O(fn_1n_2)$ 时间与 $O(fn_1 + n_2)$ 空间内求解 SegLCS，其中 $n_1 = |T_1|$，$n_2 = |T_2|$，且 $n_1 \le n_2$。最近，Banerjee 等人 [ESA 2024] 提出了一种算法，对于常数 $f \geq 3$，可在 $\tilde{O}((n_1n_2)^{1-(1/3)^{f-2}})$ 时间内求解 SegLCS。本文同时研究 SegLCS 问题以及分段子序列模式匹配问题（SegE），后者要求判断长度为 $m$ 的模式串 $P$ 是否存在一个 $f$ 段分割可嵌入长度为 $n$ 的文本串 $T$。当 $f = 1$ 时，该问题等价于子串匹配；当 $f = |P|$ 时，则等价于子序列匹配。本文重点关注 $f$ 取一般值的情况，主要贡献包括以下三点：（1）在强指数时间假设（SETH）下，对任意常数 $\epsilon > 0$，证明了 SegE 问题的 $O((mn)^{1-\epsilon})$ 时间条件性下界。（2）提出了 SegE 问题的 $O(mn)$ 时间算法。（3）提出了 SegLCS 问题的 $O(fn_2(n_1 - \ell+1))$ 时间算法，其中 $\ell$ 为解的长度。