Language models have advanced sequence analysis, yet DNA foundation models often lag behind task-specific methods for unclear reasons. We present AntigenLM, a generative DNA language model pretrained on influenza genomes with intact, aligned functional units. This structure-aware pretraining enables AntigenLM to capture evolutionary constraints and generalize across tasks. Fine-tuned on time-series hemagglutinin (HA) and neuraminidase (NA) sequences, AntigenLM accurately forecasts future antigenic variants across regions and subtypes, including those unseen during training, outperforming phylogenetic and evolution-based models. It also achieves near-perfect subtype classification. Ablation studies show that disrupting genomic structure through fragmentation or shuffling severely degrades performance, revealing the importance of preserving functional-unit integrity in DNA language modeling. AntigenLM thus provides both a powerful framework for antigen evolution prediction and a general principle for building biologically grounded DNA foundation models.
翻译:语言模型在序列分析领域取得了显著进展,然而DNA基础模型的表现往往落后于任务专用方法,其原因尚不明确。本研究提出了AntigenLM,这是一种在具有完整且对齐功能单元的流感病毒基因组上进行预训练的生成式DNA语言模型。这种结构感知的预训练使AntigenLM能够捕捉进化约束并在不同任务间实现泛化。通过在时间序列的血凝素(HA)和神经氨酸酶(NA)序列上进行微调,AntigenLM能够准确预测跨区域和亚型的未来抗原变异株,包括训练中未见过的变异株,其表现优于基于系统发育和进化的模型。该模型还实现了近乎完美的亚型分类。消融研究表明,通过片段化或重排破坏基因组结构会严重降低模型性能,这揭示了在DNA语言建模中保持功能单元完整性的重要性。因此,AntigenLM不仅为抗原进化预测提供了一个强大的框架,也为构建基于生物学原理的DNA基础模型提供了一条通用原则。