Edit distance is an important measure of string similarity. It counts the number of insertions, deletions and substitutions one has to make to a string $x$ to get a string $y$. In this paper we design an almost linear-size sketching scheme for computing edit distance up to a given threshold $k$. The scheme consists of two algorithms, a sketching algorithm and a recovery algorithm. The sketching algorithm depends on the parameter $k$ and takes as input a string $x$ and a public random string $\rho$ and computes a sketch $sk_{\rho}(x;k)$, which is a digested version of $x$. The recovery algorithm is given two sketches $sk_{\rho}(x;k)$ and $sk_{\rho}(y;k)$ as well as the public random string $\rho$ used to create the two sketches, and (with high probability) if the edit distance $ED(x,y)$ between $x$ and $y$ is at most $k$, will output $ED(x,y)$ together with an optimal sequence of edit operations that transforms $x$ to $y$, and if $ED(x,y) > k$ will output LARGE. The size of the sketch output by the sketching algorithm on input $x$ is $k{2^{O(\sqrt{\log(n)\log\log(n)})}}$ (where $n$ is an upper bound on length of $x$). The sketching and recovery algorithms both run in time polynomial in $n$. The dependence of sketch size on $k$ is information theoretically optimal and improves over the quadratic dependence on $k$ in schemes of Kociumaka, Porat and Starikovskaya (FOCS'2021), and Bhattacharya and Kouck\'y (STOC'2023).
翻译:编辑距离是衡量字符串相似性的重要指标,它表示将字符串$x$转换为字符串$y$所需的最小插入、删除和替换操作次数。本文设计了一种用于计算给定阈值$k$内编辑距离的几乎线性规模概要方案。该方案包含两个算法:概要算法和恢复算法。概要算法依赖于参数$k$,以字符串$x$和公共随机串$\rho$作为输入,计算出$x$的压缩版本——概要$sk_{\rho}(x;k)$。恢复算法接收两个概要$sk_{\rho}(x;k)$和$sk_{\rho}(y;k)$以及用于生成这两个概要的公共随机串$\rho$,当字符串$x$与$y$之间的编辑距离$ED(x,y)$不超过$k$时(以高概率)输出$ED(x,y)$及将$x$转换为$y$的最优编辑操作序列;若$ED(x,y) > k$则输出LARGE。概要算法在输入$x$时输出的概要规模为$k{2^{O(\sqrt{\log(n)\log\log(n)})}}$(其中$n$为$x$长度的上界)。概要算法与恢复算法的运行时间均为$n$的多项式级别。该方案的概要规模对$k$的依赖关系在信息论意义下达到最优,且较Kociumaka、Porat与Starikovskaya(FOCS'2021)以及Bhattacharya与Koucký(STOC'2023)方案中$k$的二次依赖有所改进。