Investigating Data Pruning for Pretraining Biological Foundation Models at Scale

Biological foundation models (BioFMs), pretrained on large-scale biological sequences, have recently shown strong potential in providing meaningful representations for diverse downstream bioinformatics tasks. However, such models often rely on millions to billions of training sequences and billions of parameters, resulting in prohibitive computational costs and significant barriers to reproducibility and accessibility, particularly for academic labs. To address these challenges, we investigate the feasibility of data pruning for BioFM pretraining and propose a post-hoc influence-guided data pruning framework tailored to biological domains. Our approach introduces a subset-based self-influence formulation that enables efficient estimation of sample importance at low computational cost, and builds upon it two simple yet effective selection strategies, namely Top-k Influence (Top I) and Coverage-Centric Influence (CCI). We empirically validate our method on two representative BioFMs, RNA-FM and ESM-C. For RNA, our framework consistently outperforms random selection baselines under an extreme pruning rate of over 99 percent, demonstrating its effectiveness. Furthermore, we show the generalizability of our framework on protein-related tasks using ESM-C. In particular, our coreset even outperforms random subsets that are ten times larger in both RNA and protein settings, revealing substantial redundancy in biological sequence datasets. These findings underscore the potential of influence-guided data pruning to substantially reduce the computational cost of BioFM pretraining, paving the way for more efficient, accessible, and sustainable biological AI research.

翻译：生物基础模型（BioFMs）通过在大规模生物序列上进行预训练，近期展现出为多种下游生物信息学任务提供有意义的表征的强大潜力。然而，这类模型通常依赖数百万至数十亿的训练序列以及数十亿的参数，导致计算成本极高，并在可复现性和可访问性方面构成显著障碍，尤其对于学术实验室而言。为应对这些挑战，我们探究了数据剪枝在BioFM预训练中的可行性，并提出了一种针对生物领域定制的后验影响引导数据剪枝框架。我们的方法引入了一种基于子集的自影响公式，能够以较低计算成本高效估计样本重要性，并在此基础上构建了两种简单而有效的选择策略，即Top-k影响（Top I）和覆盖中心影响（CCI）。我们在两个代表性的BioFM模型——RNA-FM和ESM-C上实证验证了我们的方法。对于RNA，在超过99%的极端剪枝率下，我们的框架始终优于随机选择基线，证明了其有效性。此外，我们利用ESM-C展示了该框架在蛋白质相关任务上的泛化能力。值得注意的是，在RNA和蛋白质两种场景下，我们的核心子集甚至优于规模大十倍的随机子集，揭示了生物序列数据集中存在的大量冗余。这些发现强调了影响引导数据剪枝在显著降低BioFM预训练计算成本方面的潜力，为更高效、可访问且可持续的生物人工智能研究铺平了道路。