While artificial intelligence has made remarkable strides in revealing the relationship between biological macromolecules' primary sequence and tertiary structure, designing RNA sequences based on specified tertiary structures remains challenging. Though existing approaches in protein design have thoroughly explored structure-to-sequence dependencies in proteins, RNA design still confronts difficulties due to structural complexity and data scarcity. Adding to the problem, direct transplantation of protein design methodologies into RNA design fails to achieve satisfactory outcomes although sharing similar structural components. In this study, we aim to systematically construct a data-driven RNA design pipeline. We crafted a large, well-curated benchmark dataset and designed a comprehensive structural modeling approach to represent the complex RNA tertiary structure. More importantly, we proposed a hierarchical data-efficient representation learning framework that learns structural representations through contrastive learning at both cluster-level and sample-level to fully leverage the limited data. By constraining data representations within a limited hyperspherical space, the intrinsic relationships between data points could be explicitly imposed. Moreover, we incorporated extracted secondary structures with base pairs as prior knowledge to facilitate the RNA design process. Extensive experiments demonstrate the effectiveness of our proposed method, providing a reliable baseline for future RNA design tasks. The source code and benchmark dataset will be released publicly.
翻译:尽管人工智能在揭示生物大分子一级序列与三级结构关系方面取得了显著进展,但基于特定三级结构设计RNA序列仍具挑战性。虽然蛋白质设计的现有方法已深入探索蛋白质中结构到序列的依赖关系,但RNA设计仍因结构复杂性和数据稀缺而面临困难。更棘手的是,即便蛋白质与RNA共享相似的结构组件,直接移植蛋白质设计方法到RNA领域也无法获得令人满意的结果。本研究旨在系统构建数据驱动的RNA设计流程。我们精心构建了大规模、高质量基准数据集,并设计了全面的结构建模方法来表征复杂的RNA三级结构。更重要的是,我们提出了层级数据高效表示学习框架,通过聚类级和样本级双重对比学习来充分挖掘有限数据中的结构表征。通过将数据表示约束在有限超球面空间内,数据点间的内在关联得以显式强化。此外,我们整合了包含碱基对的二级结构作为先验知识以辅助RNA设计过程。大量实验证明了所提方法的有效性,为未来RNA设计任务提供了可靠的基准。源代码与基准数据集将公开发布。