Fitness functions map large combinatorial spaces of biological sequences to properties of interest. Inferring these multimodal functions from experimental data is a central task in modern protein engineering. Global epistasis models are an effective and physically-grounded class of models for estimating fitness functions from observed data. These models assume that a sparse latent function is transformed by a monotonic nonlinearity to emit measurable fitness. Here we demonstrate that minimizing contrastive loss functions, such as the Bradley-Terry loss, is a simple and flexible technique for extracting the sparse latent function implied by global epistasis. We argue by way of a fitness-epistasis uncertainty principle that the nonlinearities in global epistasis models can produce observed fitness functions that do not admit sparse representations, and thus may be inefficient to learn from observations when using a Mean Squared Error (MSE) loss (a common practice). We show that contrastive losses are able to accurately estimate a ranking function from limited data even in regimes where MSE is ineffective. We validate the practical utility of this insight by showing contrastive loss functions result in consistently improved performance on benchmark tasks.
翻译:适应度函数将生物序列的大规模组合空间映射到感兴趣的性质。从实验数据推断这些多模态函数是现代蛋白质工程的核心任务。全局上位效应模型是一类基于物理原理的有效模型,用于从观测数据估计适应度函数。这些模型假设一个稀疏的潜在函数通过单调非线性变换产生可测量的适应度。本文证明,最小化对比损失函数(如Bradley-Terry损失)是一种简单且灵活的技术,可提取全局上位效应隐含的稀疏潜在函数。我们通过适应度-上位不确定性原理论证,全局上位效应模型中的非线性变换可能产生无法用稀疏表示描述的观测适应度函数,因此使用均方误差损失(常见做法)从观测中学习可能效率低下。我们表明,即使在均方误差失效的场景中,对比损失也能从有限数据中准确估计排序函数。通过展示对比损失函数在基准任务中带来一致性能提升,我们验证了该见解的实际应用价值。