Central Dogma Transformer II: An AI Microscope for Understanding Cellular Regulatory Mechanisms

from arxiv, 24 pages, 6 figures, 1 supplementary figure, 33 references. v2: added ENCODE enrichment analysis, feedback cycle discussion, expanded references

Current biological AI models lack interpretability -- their internal representations do not correspond to biological relationships that researchers can examine. Here we present CDT-II, an "AI microscope" whose attention maps are directly interpretable as regulatory structure. By mirroring the central dogma in its architecture, CDT-II ensures that each attention mechanism corresponds to a specific biological relationship: DNA self-attention for genomic relationships, RNA self-attention for gene co-regulation, and DNA-to-RNA cross-attention for transcriptional control. Using only genomic embeddings and raw per-cell expression, CDT-II enables experimental biologists to observe regulatory networks in their own data. Applied to K562 CRISPRi data, CDT-II predicts perturbation effects (per-gene mean $r = 0.84$) and recovers the GFI1B regulatory network without supervision (6.6-fold enrichment, $P = 3.5 \times 10^{-17}$). Systematic comparison against ENCODE K562 regulatory annotations reveals that cross-attention autonomously focuses on known regulatory elements -- DNase hypersensitive sites ($201\times$ enrichment), CTCF binding sites ($28\times$), and histone marks -- across all five held-out genes. Two distinct attention mechanisms independently identify an overlapping RNA processing module (80% gene overlap; RNA binding enrichment $P = 1 \times 10^{-16}$). CDT-II establishes mechanism-oriented AI as an alternative to task-oriented approaches, revealing regulatory structure rather than merely optimizing predictions.

翻译：当前生物学AI模型缺乏可解释性——其内部表征与研究者可检验的生物学关系不相符。本文提出CDT-II，一种注意力图谱可直接解读为调控结构的"AI显微镜"。通过在其架构中映射中心法则，CDT-II确保每个注意力机制对应特定的生物学关系：DNA自注意力对应基因组关系，RNA自注意力对应基因共调控，DNA到RNA交叉注意力对应转录调控。仅使用基因组嵌入和原始单细胞表达数据，CDT-II使实验生物学家能在自身数据中观察调控网络。应用于K562 CRISPRi数据时，CDT-II预测扰动效应（单基因平均$r = 0.84$）并在无监督条件下重建GFI1B调控网络（6.6倍富集，$P = 3.5 \times 10^{-17}$）。与ENCODE K562调控注释的系统比较表明，交叉注意力自主聚焦于已知调控元件——DNase超敏感位点（$201\times$富集）、CTCF结合位点（$28\times$）和组蛋白标记——在所有五个保留验证基因中均成立。两种不同的注意力机制独立识别出重叠的RNA加工模块（80%基因重叠；RNA结合富集$P = 1 \times 10^{-16}$）。CDT-II确立了机制导向AI作为任务导向方法的替代方案，其揭示的是调控结构而非仅仅优化预测。

相关内容

关注 7103

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【TPAMI2023】PSLT：一种带有梯形自注意力和逐步位移的轻量级视觉Transformer

专知会员服务

26+阅读 · 2023年9月4日

【CVPR2023】BiFormer:基于双层路由注意力的视觉Transformer

专知会员服务

35+阅读 · 2023年3月20日

144页ppt！《Transformers》全面讲解，附视频

专知会员服务

118+阅读 · 2023年1月1日

Bioinformatics | 链路感知的图注意力网络用于药物-药物相互作用预测

专知会员服务

15+阅读 · 2022年11月7日