This work presents RNAdiffusion, a latent diffusion model for generating and optimizing discrete RNA sequences of variable lengths. RNA is a key intermediary between DNA and protein, exhibiting high sequence diversity and complex three-dimensional structures to support a wide range of functions. We utilize pretrained BERT-type models to encode raw RNA sequences into token-level, biologically meaningful representations. A Query Transformer is employed to compress such representations into a set of fixed-length latent vectors, with an autoregressive decoder trained to reconstruct RNA sequences from these latent variables. We then develop a continuous diffusion model within this latent space. To enable optimization, we integrate the gradients of reward models--surrogates for RNA functional properties--into the backward diffusion process, thereby generating RNAs with high reward scores. Empirical results confirm that RNAdiffusion generates non-coding RNAs that align with natural distributions across various biological metrics. Further, we fine-tune the diffusion model on mRNA 5' untranslated regions (5'-UTRs) and optimize sequences for high translation efficiencies. Our guided diffusion model effectively generates diverse 5'-UTRs with high Mean Ribosome Loading (MRL) and Translation Efficiency (TE), outperforming baselines in balancing rewards and structural stability trade-off. Our findings hold potential for advancing RNA sequence-function research and therapeutic RNA design.
翻译:本研究提出了RNAdiffusion,一种用于生成和优化可变长度离散RNA序列的潜在扩散模型。RNA是DNA与蛋白质之间的关键中介分子,具有高度的序列多样性和复杂的三维结构,以支持广泛的功能。我们利用预训练的BERT类模型将原始RNA序列编码为具有生物学意义的词元级表示。通过查询变换器(Query Transformer)将这些表示压缩为一组固定长度的潜在向量,并训练自回归解码器从这些潜在变量中重建RNA序列。随后,我们在该潜在空间中构建了连续扩散模型。为实现序列优化,我们将奖励模型(作为RNA功能属性的代理指标)的梯度信息整合到反向扩散过程中,从而生成具有高奖励分数的RNA分子。实验结果表明,RNAdiffusion生成的非编码RNA在多种生物学指标上与自然分布保持一致。进一步地,我们在mRNA的5'非翻译区(5'-UTR)上对扩散模型进行微调,并优化序列以获得高翻译效率。我们的引导扩散模型能够有效生成具有高平均核糖体负载(MRL)和高翻译效率(TE)的多样化5'-UTR序列,在权衡奖励分数与结构稳定性方面优于基线方法。本研究成果对推进RNA序列-功能关系研究及治疗性RNA设计具有潜在价值。