Sequence comparison is a basic task to capture similarities and differences between two or more sequences of symbols, with countless applications such as in computational biology. An alignment is a way to compare sequences, where a giving scoring function determines the degree of similarity between them. Many scoring functions are obtained from scoring matrices. However,not all scoring matrices induce scoring functions which are distances, since the scoring function is not necessarily a metric. In this work we establish necessary and sufficient conditions for scoring matrices to induce each one of the properties of a metric in weighted edit distances. For a subset of scoring matrices that induce normalized edit distances, we also characterize each class of scoring matrices inducing normalized edit distances. Furthermore, we define an extended edit distance, which takes into account a set of editing operations that transforms one sequence into another regardless of the existence of a usual corresponding alignment to represent them, describing a criterion to find a sequence of edit operations whose weight is minimum. Similarly, we determine the class of scoring matrices that induces extended edit distances for each of the properties of a metric.
翻译:序列比较是一项基本任务,用于捕捉两个或多个符号序列之间的相似性和差异,在计算生物学等领域有着无数应用。比对是一种比较序列的方法,其中给定的评分函数决定了序列之间的相似程度。许多评分函数都源自评分矩阵。然而,并非所有评分矩阵都能诱导出作为距离的评分函数,因为评分函数不一定是度量。本文建立了评分矩阵在加权编辑距离中诱导度量各项性质的充分必要条件。针对诱导归一化编辑距离的评分矩阵子集,我们进一步刻画了诱导归一化编辑距离的各类评分矩阵。此外,我们定义了一种扩展编辑距离,该距离考虑将序列转换为另一序列的编辑操作集,而无论是否存在通常对应的比对来表示这些操作,并描述了寻找权重最小的编辑操作序列的标准。类似地,我们确定了对于度量各项性质能够诱导扩展编辑距离的评分矩阵类别。