Maximal Common Subsequences (MCSs) between two strings X and Y are subsequences of both X and Y that are maximal under inclusion. MCSs relax and generalize the well known and widely used concept of Longest Common Subsequences (LCSs), which can be seen as MCSs of maximum length. While the number both LCSs and MCSs can be exponential in the length of the strings, LCSs have been long exploited for string and text analysis, as simple compact representations of all LCSs between two strings, built via dynamic programming or automata, have been known since the '70s. MCSs appear to have a more challenging structure: even listing them efficiently was an open problem open until recently, thus narrowing the complexity difference between the two problems, but the gap remained significant. In this paper we close the complexity gap: we show how to build DAG of polynomial size-in polynomial time-which allows for efficient operations on the set of all MCSs such as enumeration in Constant Amortized Time per solution (CAT), counting, and random access to the i-th element (i.e., rank and select operations). Other than improving known algorithmic results, this work paves the way for new sequence analysis methods based on MCSs.
翻译:两个字符串X和Y之间的最大公共子序列(MCSs)是同时属于X和Y且在包含关系下极大的子序列。MCSs放松并推广了广为人知且广泛使用的经典概念——最长公共子序列(LCSs),后者可视为长度最大的MCSs。尽管LCSs和MCSs的数量都可能随字符串长度呈指数增长,但自20世纪70年代以来,人们已通过动态规划或自动机构建出表示两个字符串间所有LCSs的简洁紧凑结构,从而长期将其用于字符串与文本分析。MCSs的结构则更具挑战性:即使高效列出所有MCSs本身直到最近仍是开放问题,这虽然缩小了两类问题间的复杂性差异,但差距仍然显著。本文弥补了这一复杂性差距:我们展示了如何在多项式时间内构建一个规模为多项式大小的有向无环图(DAG),该图支持对所有MCSs集合的高效操作,包括每个解均摊常量时间(CAT)的枚举、计数以及对第i个元素的随机访问(即秩与选择操作)。除改进已有算法结果外,本研究为基于MCSs的新型序列分析方法奠定了基础。