Weighted Edit Distance Computation: Strings, Trees and Dyck

Given two strings of length $n$ over alphabet $\Sigma$, and an upper bound $k$ on their edit distance, the algorithm of Myers (Algorithmica'86) and Landau and Vishkin (JCSS'88) computes the unweighted string edit distance in $\mathcal{O}(n+k^2)$ time. Till date, it remains the fastest algorithm for exact edit distance computation, and it is optimal under the Strong Exponential Hypothesis (STOC'15). Over the years, this result has inspired many developments, including fast approximation algorithms for string edit distance as well as similar $\tilde{\mathcal{O}}(n+$poly$(k))$-time algorithms for generalizations to tree and Dyck edit distances. Surprisingly, all these results hold only for unweighted instances. While unweighted edit distance is theoretically fundamental, almost all real-world applications require weighted edit distance, where different weights are assigned to different edit operations and may vary with the characters being edited. Given a weight function $w: \Sigma \cup \{\varepsilon \}\times \Sigma \cup \{\varepsilon \} \rightarrow \mathbb{R}_{\ge 0}$ (such that $w(a,a)=0$ and $w(a,b)\ge 1$ for all $a,b\in \Sigma \cup \{\varepsilon\}$ with $a\ne b$), the goal is to find an alignment that minimizes the total weight of edits. Except for the vanilla $\mathcal{O}(n^2)$-time dynamic-programming algorithm and its almost trivial $\mathcal{O}(nk)$-time implementation, none of the aforementioned developments on the unweighted edit distance apply to the weighted variant. In this paper, we propose the first $\mathcal{O}(n+$poly$(k))$-time algorithm that computes weighted string edit distance exactly, thus bridging a fundamental gap between our understanding of unweighted and weighted edit distance. We then generalize this result to weighted tree and Dyck edit distances, which lead to a deterministic algorithm that improves upon the previous work for unweighted tree edit distance.

翻译：给定字母表$\Sigma$上长度为$n$的两个字符串以及它们编辑距离的上界$k$，Myers（Algorithmica'86）以及Landau和Vishkin（JCSS'88）的算法能在$\mathcal{O}(n+k^2)$时间内计算无权重字符串编辑距离。迄今为止，它仍是精确编辑距离计算的最快算法，且在强指数假设（STOC'15）下是最优的。多年来，这一结果催生了许多进展，包括字符串编辑距离的快速近似算法，以及针对树编辑距离和Dyck编辑距离等推广问题的类似$\tilde{\mathcal{O}}(n+$poly$(k))$时间算法。令人惊讶的是，所有这些结果仅适用于无权重实例。虽然无权重编辑距离在理论上具有基础性地位，但几乎所有实际应用都需要加权编辑距离，其中不同的编辑操作被赋予不同权重，且权重可能随被编辑字符而变化。给定一个权重函数$w: \Sigma \cup \{\varepsilon \}\times \Sigma \cup \{\varepsilon \} \rightarrow \mathbb{R}_{\ge 0}$（满足对于所有$a,b\in \Sigma \cup \{\varepsilon\}$且$a\ne b$，有$w(a,a)=0$且$w(a,b)\ge 1$），目标是找到最小化编辑总权重的对齐方案。除了简单的$\mathcal{O}(n^2)$时间动态规划算法及其几乎平凡的$\mathcal{O}(nk)$时间实现外，上述关于无权重编辑距离的进展均不适用于加权变体。在本文中，我们提出了首个精确计算加权字符串编辑距离的$\mathcal{O}(n+$poly$(k))$时间算法，从而弥合了我们对无权重与加权编辑距离理解之间的根本差距。随后，我们将这一结果推广到加权树编辑距离和加权Dyck编辑距离，这产生了比先前无权重视图编辑距离工作更优的确定性算法。