Protein inverse folding is a fundamental problem in bioinformatics, aiming to recover the amino acid sequences from a given protein backbone structure. Despite the success of existing methods, they struggle to fully capture the intricate inter-residue relationships critical for accurate sequence prediction. We propose a novel method that leverages diffusion models with representation alignment (DMRA), which enhances diffusion-based inverse folding by (1) proposing a shared center that aggregates contextual information from the entire protein structure and selectively distributes it to each residue; and (2) aligning noisy hidden representations with clean semantic representations during the denoising process. This is achieved by predefined semantic representations for amino acid types and a representation alignment method that utilizes type embeddings as semantic feedback to normalize each residue. In experiments, we conduct extensive evaluations on the CATH4.2 dataset to demonstrate that DMRA outperforms leading methods, achieving state-of-the-art performance and exhibiting strong generalization capabilities on the TS50 and TS500 datasets.
翻译:蛋白质逆折叠是生物信息学中的一个基本问题,其目标是从给定的蛋白质主链结构中恢复氨基酸序列。尽管现有方法已取得一定成功,但它们难以充分捕捉对准确序列预测至关重要的复杂残基间关系。我们提出了一种利用表示对齐扩散模型(DMRA)的新方法,该方法通过以下方式增强基于扩散的逆折叠:(1)提出一个共享中心,聚合来自整个蛋白质结构的上下文信息,并选择性地将其分配给每个残基;(2)在去噪过程中将含噪声的隐藏表示与干净的语义表示对齐。这是通过为氨基酸类型预定义语义表示,以及一种利用类型嵌入作为语义反馈来归一化每个残基的表示对齐方法实现的。在实验中,我们在CATH4.2数据集上进行了广泛评估,结果表明DMRA优于现有领先方法,实现了最先进的性能,并在TS50和TS500数据集上展现出强大的泛化能力。