Fitness functions map large combinatorial spaces of biological sequences to properties of interest. Inferring these multimodal functions from experimental data is a central task in modern protein engineering. Global epistasis models are an effective and physically-grounded class of models for estimating fitness functions from observed data. These models assume that a sparse latent function is transformed by a monotonic nonlinearity to emit measurable fitness. Here we demonstrate that minimizing contrastive loss functions, such as the Bradley-Terry loss, is a simple and flexible technique for extracting the sparse latent function implied by global epistasis. We argue by way of a fitness-epistasis uncertainty principle that the nonlinearities in global epistasis models can produce observed fitness functions that do not admit sparse representations, and thus may be inefficient to learn from observations when using a Mean Squared Error (MSE) loss (a common practice). We show that contrastive losses are able to accurately estimate a ranking function from limited data even in regimes where MSE is ineffective. We validate the practical utility of this insight by showing contrastive loss functions result in consistently improved performance on benchmark tasks.
翻译:适应度函数将生物序列的大规模组合空间映射到感兴趣的特性。从实验数据推断这些多模态函数是现代蛋白质工程的核心任务。全局上位效应模型是一类基于物理原理的有效模型,用于根据观测数据估计适应度函数。这些模型假设一个稀疏潜函数通过单调非线性变换生成可测量的适应度。在此,我们证明最小化对比损失函数(如Bradley-Terry损失)是一种简单灵活的技术,可提取全局上位效应隐含的稀疏潜函数。通过适应度-上位不确定性原理,我们论证全局上位效应模型中的非线性会导致观测到的适应度函数无法获得稀疏表示,因此在使用均方误差损失(常见做法)从观测数据学习时效率可能低下。我们表明,即使在MSE失效的情况下,对比损失也能从有限数据中准确估计排序函数。我们通过对比损失函数在基准任务上持续提升性能的结果,验证了该洞察的实际效用。