Gene regulatory networks (GRNs) represent the causal relationships between transcription factors (TFs) and target genes in single-cell RNA sequencing (scRNA-seq) data. Understanding these networks is crucial for uncovering disease mechanisms and identifying therapeutic targets. In this work, we investigate the potential of large language models (LLMs) for GRN discovery, leveraging their learned biological knowledge alone or in combination with traditional statistical methods. We develop a task-based evaluation strategy to address the challenge of unavailable ground truth causal graphs. Specifically, we use the GRNs suggested by LLMs to guide causal synthetic data generation and compare the resulting data against the original dataset. Our statistical and biological assessments show that LLMs can support statistical modeling and data synthesis for biological research.
翻译:基因调控网络(GRN)表征了单细胞RNA测序(scRNA-seq)数据中转录因子(TF)与靶基因之间的因果关系。理解这些网络对于揭示疾病机制和识别治疗靶点至关重要。在本研究中,我们探索了大语言模型(LLM)在GRN发现中的潜力,利用其已学习的生物学知识,单独或与传统统计方法结合使用。我们开发了一种基于任务的评估策略,以应对真实因果图谱缺失的挑战。具体而言,我们使用LLM建议的GRN来指导因果合成数据的生成,并将生成的数据与原始数据集进行比较。我们的统计和生物学评估表明,LLM能够为生物学研究中的统计建模和数据合成提供支持。