The Shortest Common Superstring (SCS) problem asks for the shortest string that contains each of a given set of strings as a substring. Its reverse-complement variant, the Shortest Common Superstring problem with Reverse Complements (SCS-RC), naturally arises in bioinformatics applications, where for each input string, either the string itself or its reverse complement must appear as a substring of the superstring. The well-known MGREEDY algorithm for the standard SCS constructs a superstring by first computing an optimal cycle cover on the overlap graph and then concatenating the strings corresponding to the cycles, while its refined variant, TGREEDY, further improves the approximation ratio. Although the original 4- and 3-approximation bounds of these algorithms have been successively improved for the standard SCS, no such progress has been made for the reverse-complement setting. A previous study extended MGREEDY to SCS-RC with a 4-approximation guarantee and briefly suggested that extending TGREEDY to the reverse-complement setting could achieve a 3-approximation. In this work, we strengthen these results by proving that the extensions of MGREEDY and TGREEDY to the reverse-complement setting achieve 3.75- and 2.875-approximation ratios, respectively. Our analysis extends the classical proofs for the standard SCS to handle the bidirectional overlaps introduced by reverse complements. These results provide the first formal improvement of approximation guarantees for SCS-RC, with the 2.875-approximate algorithm currently representing the best known bound for this problem.
翻译:最短公共超串(SCS)问题要求找到一个最短字符串,使得给定字符串集合中的每个字符串都作为其子串出现。其反向互补变体——带反向互补的最短公共超串问题(SCS-RC)在生物信息学应用中自然产生,其中对于每个输入字符串,该字符串本身或其反向互补串必须作为超串的子串出现。针对标准SCS的著名MGREEDY算法通过首先在重叠图上计算最优环覆盖,然后连接对应环的字符串来构建超串;而其改进版本TGREEDY进一步提升了近似比。尽管这些算法原始的4倍和3倍近似界在标准SCS中已得到持续改进,但在反向互补设定下尚未取得类似进展。先前研究将MGREEDY扩展至SCS-RC并获得了4倍近似保证,同时简要指出将TGREEDY扩展至反向互补设定可能实现3倍近似。本研究中,我们通过证明MGREEDY和TGREEDY在反向互补设定下的扩展版本分别达到3.75倍和2.875倍近似比,强化了这些结果。我们的分析将标准SCS的经典证明扩展至处理反向互补引入的双向重叠情形。这些结果为SCS-RC提供了近似保证的首次正式改进,其中2.875倍近似算法目前代表了该问题已知的最佳上界。