Scalable and Interpretable Identification of Minimal Undesignable RNA Structure Motifs with Rotational Invariance

RNA design aims to find a sequence that folds with highest probability into a designated target structure. However, certain structures are undesignable, meaning no sequence can fold into the target structure under the default (Turner) RNA folding model. Understanding the specific local structures (i.e., "motifs") that contribute to undesignability is crucial for refining RNA folding models and determining the limits of RNA designability. Despite its importance, this problem has received very little attention, and previous efforts are neither scalable nor interpretable. We develop a new theoretical framework for motif (un-)designability, and design scalable and interpretable algorithms to identify minimal undesignable motifs within a given RNA secondary structure. Our approach establishes motif undesignability by searching for rival motifs, rather than exhaustively enumerating all (partial) sequences that could potentially fold into the motif. Furthermore, we exploit rotational invariance in RNA structures to detect, group, and reuse equivalent motifs and to construct a database of unique minimal undesignable motifs. To achieve that, we propose a loop-pair graph representation for motifs and a recursive graph isomorphism algorithm for motif equivalence. Our algorithms successfully identify 24 unique minimal undesignable motifs among 18 undesignable puzzles from the Eterna100 benchmark. Surprisingly, we also find over 350 unique minimal undesignable motifs and 663 undesignable native structures in the ArchiveII dataset, drawn from a diverse set of RNA families. Our source code is available at https://github.com/shanry/RNA-Undesign and our web server is available at http://linearfold.org/motifs.

翻译：RNA设计旨在寻找能以最高概率折叠成指定目标结构的序列。然而，某些结构是不可设计的，即在默认（特纳）RNA折叠模型下，没有序列能折叠成目标结构。理解导致不可设计性的特定局部结构（即“基序”）对于改进RNA折叠模型和确定RNA可设计性的极限至关重要。尽管该问题非常重要，却极少受到关注，且先前的研究方法既不可扩展也不可解释。我们为基序的（不可）可设计性开发了一个新的理论框架，并设计了可扩展且可解释的算法，以识别给定RNA二级结构中的最小不可设计基序。我们的方法通过搜索竞争基序来确立基序的不可设计性，而非穷举枚举所有可能折叠成该基序的（部分）序列。此外，我们利用RNA结构中的旋转不变性来检测、分组和重用等价基序，并构建一个独特的最小不可设计基序数据库。为此，我们提出了用于基序的环对图表示法和用于基序等价性的递归图同构算法。我们的算法成功地从Eterna100基准测试的18个不可设计谜题中识别出24个独特的最小不可设计基序。令人惊讶的是，我们还从涵盖多种RNA家族的ArchiveII数据集中发现了超过350个独特的最小不可设计基序和663个不可设计的天然结构。我们的源代码可在 https://github.com/shanry/RNA-Undesign 获取，我们的网络服务器可在 http://linearfold.org/motifs 访问。