The modeling of genomic sequences presents unique challenges due to their length and structural complexity. Traditional sequence models struggle to capture long-range dependencies and biological features inherent in DNA. In this work, we propose TrinityDNA, a novel DNA foundational model designed to address these challenges. The model integrates biologically informed components, including Groove Fusion for capturing DNA's structural features and Gated Reverse Complement (GRC) to handle the inherent symmetry of DNA sequences. Additionally, we introduce a multi-scale attention mechanism that allows the model to attend to varying levels of sequence dependencies, and an evolutionary training strategy that progressively adapts the model to both prokaryotic and eukaryotic genomes. TrinityDNA provides a more accurate and efficient approach to genomic sequence modeling, offering significant improvements in gene function prediction, regulatory mechanism discovery, and other genomics applications. Our model bridges the gap between machine learning techniques and biological insights, paving the way for more effective analysis of genomic data. Additionally, we introduced a new DNA long-sequence CDS annotation benchmark to make evaluations more comprehensive and oriented toward practical applications.
翻译:基因组序列建模因其长度和结构复杂性而面临独特挑战。传统序列模型难以捕捉DNA固有的长程依赖性和生物学特征。本研究提出TrinityDNA——一种新型DNA基础模型,旨在应对这些挑战。该模型整合了生物信息启发的组件:包括捕获DNA结构特征的沟槽融合模块,以及处理DNA序列固有对称性的门控反向互补模块。此外,我们引入了多尺度注意力机制,使模型能够关注不同层级的序列依赖关系;并提出进化式训练策略,使模型逐步适应原核与真核基因组。TrinityDNA为基因组序列建模提供了更精准高效的解决方案,在基因功能预测、调控机制发现等基因组学应用中实现显著性能提升。该模型弥合了机器学习技术与生物学洞见之间的鸿沟,为基因组数据的更有效分析开辟了新途径。同时,我们构建了新的DNA长序列CDS注释基准测试,使评估体系更全面且更贴近实际应用需求。