Elastic-Degenerate String Comparison

An elastic-degenerate (ED) string $T$ is a sequence of $n$ sets $T[1],\ldots,T[n]$ containing $m$ strings in total whose cumulative length is $N$. We call $n$, $m$, and $N$ the length, the cardinality and the size of $T$, respectively. The language of $T$ is defined as $L(T)=\{S_1 \cdots S_n\,:\,S_i \in T[i]$ for all $i\in[1,n]\}$. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem.For two ED strings $T_1$ and $T_2$ of lengths $n_1$ and $n_2$, cardinalities $m_1$ and $m_2$, and sizes $N_1$ and $N_2$, respectively, we show the following: - There is no $O((N_1N_2)^{1-\epsilon})$-time algorithm, for any constant $\epsilon>0$, for EDSI even when $T_1$ and $T_2$ are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. - There is no combinatorial $O((N_1+N_2)^{1.2-\epsilon}f(n_1,n_2))$-time algorithm, for any constant $\epsilon>0$ and any function $f$, for EDSI even when $T_1$ and $T_2$ are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. - An $O(N_1\log N_1\log n_1+N_2\log N_2\log n_2)$-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when $T_1$ and $T_2$ are given in a compact representation, we show that the problem is NP-complete. - An $O(N_1m_2+N_2m_1)$-time algorithm for EDSI. - An $\tilde{O}(N_1^{\omega-1}n_2+N_2^{\omega-1}n_1)$-time algorithm for EDSI, where $\omega$ is the exponent of matrix multiplication; the $\tilde{O}$ notation suppresses factors that are polylogarithmic in the input size.

翻译：弹性退化（ED）字符串 $T$ 是一个包含 $n$ 个集合 $T[1],\ldots,T[n]$ 的序列，这些集合总共包含 $m$ 个字符串，其累积长度为 $N$。我们分别称 $n$、$m$ 和 $N$ 为 $T$ 的长度、基数和大小。$T$ 的语言定义为 $L(T)=\{S_1 \cdots S_n\,:\,S_i \in T[i]$ 对于所有 $i\in[1,n]\}$。ED 字符串被引入用于表示一组密切相关的 DNA 序列，也称为泛基因组。我们在此研究的基本问题是：给定两个 ED 字符串，我们能够以多快的速度检查它们所表示的语言是否具有非空交集？我们将此基础问题称为 ED 字符串交集（EDSI）问题。对于两个 ED 字符串 $T_1$ 和 $T_2$，其长度分别为 $n_1$ 和 $n_2$，基数分别为 $m_1$ 和 $m_2$，大小分别为 $N_1$ 和 $N_2$，我们展示了以下结果：- 对于 EDSI 问题，即使 $T_1$ 和 $T_2$ 基于二进制字母表，也不存在 $O((N_1N_2)^{1-\epsilon})$ 时间复杂度的算法（对于任意常数 $\epsilon>0$），除非强指数时间假设不成立。- 对于 EDSI 问题，即使 $T_1$ 和 $T_2$ 基于二进制字母表，也不存在组合 $O((N_1+N_2)^{1.2-\epsilon}f(n_1,n_2))$ 时间复杂度的算法（对于任意常数 $\epsilon>0$ 和任意函数 $f$），除非布尔矩阵乘法猜想不成立。- 对于输出两个一元 ED 字符串交集语言的紧凑（RLE）表示，存在 $O(N_1\log N_1\log n_1+N_2\log N_2\log n_2)$ 时间复杂度的算法。当 $T_1$ 和 $T_2$ 以紧凑表示给出时，我们证明该问题是 NP 完全的。- 对于 EDSI 问题，存在 $O(N_1m_2+N_2m_1)$ 时间复杂度的算法。- 对于 EDSI 问题，存在 $\tilde{O}(N_1^{\omega-1}n_2+N_2^{\omega-1}n_1)$ 时间复杂度的算法，其中 $\omega$ 是矩阵乘法的指数；$\tilde{O}$ 符号抑制了输入大小的多对数因子。