Word Break is a prototypical factorization problem in string processing: Given a word $w$ of length $N$ and a dictionary $\mathcal{D} = \{d_1, d_2, \ldots, d_{K}\}$ of $K$ strings, determine whether we can partition $w$ into words from $\mathcal{D}$. We propose the first algorithm that solves the Word Break problem over the SLP-compressed input text $w$. Specifically, we show that, given the string $w$ represented using an SLP of size $g$, we can solve the Word Break problem in $\mathcal{O}(g \cdot m^{\omega} + M)$ time, where $m = \max_{i=1}^{K} |d_i|$, $M = \sum_{i=1}^{K} |d_i|$, and $\omega \geq 2$ is the matrix multiplication exponent. We obtain our algorithm as a simple corollary of a more general result: We show that in $\mathcal{O}(g \cdot m^{\omega} + M)$ time, we can index the input text $w$ so that solving the Word Break problem for any of its substrings takes $\mathcal{O}(m^2 \log N)$ time (independent of the substring length). Our second contribution is a lower bound: We prove that, unless the Combinatorial $k$-Clique Conjecture fails, there is no combinatorial algorithm for Word Break on SLP-compressed strings running in $\mathcal{O}(g \cdot m^{2-\epsilon} + M)$ time for any $\epsilon > 0$.
翻译:词语切分是字符串处理中的典型分解问题:给定长度为$N$的单词$w$和包含$K$个字符串的字典$\mathcal{D} = \{d_1, d_2, \ldots, d_{K}\}$,判断是否可以将$w$分割为字典$\mathcal{D}$中的单词。我们提出了首个在SLP压缩输入文本$w$上解决词语切分问题的算法。具体而言,我们证明当字符串$w$使用大小为$g$的SLP表示时,可以在$\mathcal{O}(g \cdot m^{\omega} + M)$时间内解决词语切分问题,其中$m = \max_{i=1}^{K} |d_i|$,$M = \sum_{i=1}^{K} |d_i|$,而$\omega \geq 2$是矩阵乘法指数。我们的算法是一个更一般结果的直接推论:我们证明在$\mathcal{O}(g \cdot m^{\omega} + M)$时间内,可以为输入文本$w$建立索引,使得对其任意子串解决词语切分问题仅需$\mathcal{O}(m^2 \log N)$时间(与子串长度无关)。我们的第二项贡献是下界证明:除非组合$k$-团猜想不成立,否则对于任意$\epsilon > 0$,不存在在SLP压缩字符串上运行时间为$\mathcal{O}(g \cdot m^{2-\epsilon} + M)$的组合算法。