Exploring the limits of pre-trained embeddings in machine-guided protein design: a case study on predicting AAV vector viability

Effective representations of protein sequences are widely recognized as a cornerstone of machine learning-based protein design. Yet, protein bioengineering poses unique challenges for sequence representation, as experimental datasets typically feature few mutations, which are either sparsely distributed across the entire sequence or densely concentrated within localized regions. This limits the ability of sequence-level representations to extract functionally meaningful signals. In addition, comprehensive comparative studies remain scarce, despite their crucial role in clarifying which representations best encode relevant information and ultimately support superior predictive performance. In this study, we systematically evaluate multiple ProtBERT and ESM2 embedding variants as sequence representations, using the adeno-associated virus capsid as a case study and prototypical example of bioengineering, where functional optimization is targeted through highly localized sequence variation within an otherwise large protein. Our results reveal that, prior to fine-tuning, amino acid-level embeddings outperform sequence-level representations in supervised predictive tasks, whereas the latter tend to be more effective in unsupervised settings. However, optimal performance is only achieved when embeddings are fine-tuned with task-specific labels, with sequence-level representations providing the best performance. Moreover, our findings indicate that the extent of sequence variation required to produce notable shifts in sequence representations exceeds what is typically explored in bioengineering studies, showing the need for fine-tuning in datasets characterized by sparse or highly localized mutations.

翻译：蛋白质序列的有效表征被广泛认为是基于机器学习的蛋白质设计的基石。然而，蛋白质生物工程对序列表征提出了独特的挑战，因为实验数据集通常包含的突变数量较少，这些突变要么稀疏地分布在整个序列中，要么密集地集中在局部区域。这限制了序列级表征提取具有功能意义信号的能力。此外，尽管全面的比较研究对于阐明哪些表征能最佳地编码相关信息并最终支持更优的预测性能至关重要，但此类研究仍然匮乏。在本研究中，我们以腺相关病毒衣壳作为案例研究和生物工程的典型范例，系统地评估了多种ProtBERT和ESM2嵌入变体作为序列表征的效果。在该范例中，功能优化是通过一个大型蛋白质中高度局部化的序列变异来实现的。我们的结果表明，在微调之前，氨基酸级嵌入在监督预测任务中优于序列级表征，而后者在无监督设置中往往更有效。然而，只有在使用任务特定标签对嵌入进行微调后，才能达到最佳性能，其中序列级表征提供了最佳性能。此外，我们的研究结果表明，要引起序列表征的显著变化所需的序列变异程度，超出了生物工程研究中通常探索的范围，这表明在以稀疏或高度局部化的突变为特征的数据集中进行微调是必要的。