The advent of single-cell Assay for Transposase-Accessible Chromatin using sequencing (scATAC-seq) offers an innovative perspective for deciphering regulatory mechanisms by assembling a vast repository of single-cell chromatin accessibility data. While foundation models have achieved significant success in single-cell transcriptomics, there is currently no foundation model for scATAC-seq that supports zero-shot high-quality cell identification and comprehensive multi-omics analysis simultaneously. Key challenges lie in the high dimensionality and sparsity of scATAC-seq data, as well as the lack of a standardized schema for representing open chromatin regions (OCRs). Here, we present ChromFound, a foundation model tailored for scATAC-seq. ChromFound utilizes a hybrid architecture and genome-aware tokenization to effectively capture genome-wide long contexts and regulatory signals from dynamic chromatin landscapes. Pretrained on 1.97 million cells from 30 tissues and 6 disease conditions, ChromFound demonstrates broad applicability across 6 diverse tasks. Notably, it achieves robust zero-shot performance in generating universal cell representations and exhibits excellent transferability in cell type annotation and cross-omics prediction. By uncovering enhancer-gene links undetected by existing computational methods, ChromFound offers a promising framework for understanding disease risk variants in the noncoding genome.
翻译:单细胞转座酶可及染色质测序(scATAC-seq)技术的出现,通过构建大规模单细胞染色质可及性数据资源,为解析调控机制提供了创新视角。尽管基础模型在单细胞转录组学领域已取得显著成功,但目前尚缺乏一个能同时支持零样本高质量细胞鉴定与全面多组学分析的scATAC-seq基础模型。关键挑战在于scATAC-seq数据的高维性与稀疏性,以及缺乏表征开放染色质区域(OCR)的标准化框架。本文提出ChromFound——一个专为scATAC-seq设计的基础模型。该模型采用混合架构与基因组感知分词技术,有效捕捉全基因组长程上下文及动态染色质景观中的调控信号。通过在30种组织与6种疾病状态下197万个细胞上进行预训练,ChromFound在6类差异化任务中展现出广泛适用性。尤为突出的是,其在生成通用细胞表征方面实现了稳健的零样本性能,并在细胞类型注释与跨组学预测中表现出卓越的迁移能力。通过揭示现有计算方法未能检测的增强子-基因关联,ChromFound为理解非编码基因组中的疾病风险变异提供了前景广阔的研究框架。