HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive empirical results show that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks, performing on par with state-of-the-art multimodal pLMs despite being developed under limited computational resources. It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.

翻译：蛋白质本质上具有一致的序列-结构二象性。蛋白质序列数据丰富且易于表示为离散标记，这推动了蛋白质语言模型（pLMs）的蓬勃发展。然而，一个关键挑战是如何将连续的结构知识有效整合到pLMs中。现有方法通常将蛋白质结构离散化以适应语言建模框架，这不可避免地导致细粒度信息丢失，并限制了多模态pLMs的性能潜力。本文认为，此类问题可以规避：基于序列的pLM可通过连续标记扩展以纳入结构模态，即避免向量量化的高保真蛋白质结构潜在表示。具体而言，我们提出了一种混合扩散蛋白质语言模型HD-Prot，它在离散pLM之上嵌入连续值扩散头，使其能够无缝处理离散和连续标记，实现序列-结构联合建模。该模型通过统一的吸收扩散过程捕获跨模态的标记间依赖关系，并分别通过序列的分类预测和结构的连续扩散来估计每个标记的分布。大量实验结果表明，HD-Prot在无条件序列-结构协同生成、基序支架构建、蛋白质结构预测和逆折叠任务中均取得具有竞争力的性能，与当前最先进的多模态pLMs表现相当，尽管其开发过程计算资源有限。这突显了在统一语言模型架构中同时估计分类和连续分布的可行性，为多模态pLMs提供了一个有前景的替代方向。