Robust Machine Learning for Regulatory Sequence Modeling under Biological and Technical Distribution Shifts

Robust machine learning for regulatory genomics is studied under biologically and technically induced distribution shifts. Deep convolutional and attention based models achieve strong in distribution performance on DNA regulatory sequence prediction tasks but are usually evaluated under i.i.d. assumptions, even though real applications involve cell type specific programs, evolutionary turnover, assay protocol changes, and sequencing artifacts. We introduce a robustness framework that combines a mechanistic simulation benchmark with real data analysis on a massively parallel reporter assay (MPRA) dataset to quantify performance degradation, calibration failures, and uncertainty based reliability. In simulation, motif driven regulatory outputs are generated with cell type specific programs, PWM perturbations, GC bias, depth variation, batch effects, and heteroscedastic noise, and CNN, BiLSTM, and transformer models are evaluated. Models remain accurate and reasonably calibrated under mild GC content shifts but show higher error, severe variance miscalibration, and coverage collapse under motif effect rewiring and noise dominated regimes, revealing robustness gaps invisible to standard i.i.d. evaluation. Adding simple biological structural priors motif derived features in simulation and global GC content in MPRA improves in distribution error and yields consistent robustness gains under biologically meaningful genomic shifts, while providing only limited protection against strong assay noise. Uncertainty-aware selective prediction offers an additional safety layer that risk coverage analyses on simulated and MPRA data show that filtering low confidence inputs recovers low risk subsets, including under GC-based out-of-distribution conditions, although reliability gains diminish when noise dominates.

翻译：本研究在生物和技术因素引起的分布偏移下，探讨调控基因组学的鲁棒机器学习方法。深度卷积与注意力模型在DNA调控序列预测任务中展现出优异的同分布性能，但现有评估通常基于独立同分布假设，而实际应用场景涉及细胞类型特异性调控程序、进化更替、检测协议变更及测序伪影。我们提出一个鲁棒性评估框架，该框架结合机理模拟基准与大规模并行报告基因检测数据集的实际数据分析，以量化性能退化、校准失效及基于不确定性的可靠性。在模拟环境中，通过细胞类型特异性调控程序、PWM扰动、GC偏好、测序深度变异、批次效应和异方差噪声生成基序驱动的调控输出，并对CNN、BiLSTM及Transformer模型进行评估。模型在温和的GC含量偏移下保持准确性与合理校准，但在基序效应重连和噪声主导机制下表现出更高误差、严重的方差校准失准及覆盖度崩溃，这揭示了标准独立同分布评估无法察觉的鲁棒性缺陷。在模拟中引入基于基序特征的简单生物结构先验，以及在MPRA数据中引入全局GC含量信息，可改善同分布误差并在具有生物学意义的基因组偏移下获得一致的鲁棒性提升，但对强检测噪声的防护作用有限。基于不确定性的选择性预测提供了额外安全层：对模拟与MPRA数据的风险覆盖分析表明，过滤低置信度输入可恢复低风险子集（包括基于GC的分布外条件），但当噪声占主导时可靠性增益会减弱。