Retrieval-Augmented Generation for Predicting Cellular Responses to Gene Perturbation

Predicting how cells respond to genetic perturbations is fundamental to understanding gene function, disease mechanisms, and therapeutic development. While recent deep learning approaches have shown promise in modeling single-cell perturbation responses, they struggle to generalize across cell types and perturbation contexts due to limited contextual information during generation. We introduce PT-RAG (Perturbation-aware Two-stage Retrieval-Augmented Generation), a novel framework that extends Retrieval-Augmented Generation beyond traditional language-model applications to cellular biology. Unlike standard RAG systems designed for text retrieval with pre-trained LLMs, perturbation retrieval lacks established similarity metrics and requires learning what constitutes relevant context, making differentiable retrieval essential. PT-RAG addresses this through a two-stage pipeline: first, retrieving candidate perturbations $K$ using GenePT embeddings, then adaptively refining the selection through Gumbel-Softmax discrete sampling conditioned on both the cell state and the input perturbation. This cell-type-aware differentiable retrieval enables end-to-end optimization of the retrieval objective jointly with generation. On the Replogle-Nadig single-gene perturbation dataset, we demonstrate that PT-RAG outperforms both STATE and vanilla RAG under identical experimental conditions, with the strongest gains in distributional similarity metrics ($W_1$, $W_2$). Notably, vanilla RAG's dramatic failure is itself a key finding: it demonstrates that differentiable, cell-type-aware retrieval is essential in this domain, and that naive retrieval can actively harm performance. Our results establish retrieval-augmented generation as a promising paradigm for modelling cellular responses to gene perturbation. The code to reproduce our experiments is available at https://github.com/difra100/PT-RAG_ICLR.

翻译：预测细胞如何响应遗传扰动是理解基因功能、疾病机制和治疗开发的基础。尽管最近的深度学习方法在建模单细胞扰动响应方面显示出潜力，但由于生成过程中上下文信息有限，它们在跨细胞类型和扰动情境的泛化方面存在困难。我们提出了PT-RAG（扰动感知的两阶段检索增强生成），这是一个新颖的框架，它将检索增强生成技术从传统的语言模型应用扩展到细胞生物学领域。与为预训练大语言模型设计的标准文本检索RAG系统不同，扰动检索缺乏既定的相似性度量，并且需要学习什么构成相关上下文，这使得可微分检索至关重要。PT-RAG通过一个两阶段的流程来解决这个问题：首先，使用GenePT嵌入检索候选扰动$K$，然后根据细胞状态和输入扰动，通过Gumbel-Softmax离散采样自适应地细化选择。这种细胞类型感知的可微分检索使得检索目标能够与生成过程联合进行端到端优化。在Replogle-Nadig单基因扰动数据集上，我们证明在相同的实验条件下，PT-RAG的性能优于STATE和原始RAG，在分布相似性度量（$W_1$, $W_2$）上提升最为显著。值得注意的是，原始RAG的显著失败本身就是一个关键发现：它表明在该领域中，可微分的、细胞类型感知的检索是必不可少的，而简单的检索可能会严重损害性能。我们的研究结果确立了检索增强生成作为建模细胞对基因扰动响应的一种有前景的范式。重现我们实验的代码可在 https://github.com/difra100/PT-RAG_ICLR 获取。