Proteo-R1: Reasoning Foundation Models for De Novo Protein Design

Fang Wu,Weihao Xuan,Heli Qi,Hanqun Cao,Heng-Jui Chang,Zeqi Zhou,Haokai Zhao,Ma Jian,Carl Ma,Yu-Chi Cheng,Kuan Pang,Xiangru Tang,Zehong Wang,Guanlue Li,Hanchen Wang,Kejun Ying,Pan Lu,Chiho Im,Seungju Han,Peng Xia,Tinson Xu,Yinxi Li,Deyao Zhu,Pheng-Ann Heng,Naoto Yokoya,Masashi Sugiyama,Li Erran Li,Jure Leskovec,Yejin Choi

Deep learning in \emph{de novo} protein design has achieved atomic-level fidelity. However, existing models remain largely non-deliberative: they directly synthesize molecular geometries without explicitly reasoning about which residues or interactions are functionally essential. As a result, design decisions are entangled with continuous sampling dynamics, limiting interpretability, controllability, and systematic reuse of biochemical knowledge. We introduce \textbf{Proteo-R1}, a reasoning-guided protein design framework that explicitly decouples \emph{molecular understanding} from \emph{geometric generation}. Proteo-R1 adopts a dual-expert architecture in which a multimodal large language model (MLLM) serves as an \emph{understanding expert}, analyzing protein sequences, structures, and textual context to identify key functional residues that govern binding and specificity. These residue-level decisions are then passed as hard constraints to a separate diffusion-based \emph{generation expert}, which performs conditional co-design while respecting the fixed interaction anchors. This factorization mirrors how human experts approach molecular engineering: first, reasoning about critical interactions, then optimizing geometry subject to those constraints. By operationalizing reasoning as explicit residue-level commitments rather than latent textual guidance, Proteo-R1 achieves stable, interpretable, and modular integration of LLM reasoning with state-of-the-art geometric generative models. Code, data, and demos are available at https://smiles724.github.io/r1/.

翻译：深度学习在从头蛋白质设计中已实现原子级保真度。然而，现有模型仍基本缺乏推理能力：它们直接合成分子几何结构，而并未明确推理哪些残基或相互作用在功能上是必需的。因此，设计决策与连续采样动力学纠缠在一起，限制了可解释性、可控性以及生化知识的系统性复用。我们提出**Proteo-R1**，一种推理引导的蛋白质设计框架，该框架明确解耦了**分子理解**与**几何生成**。Proteo-R1采用双专家架构，其中多模态大语言模型（MLLM）作为*理解专家*，分析蛋白质序列、结构和文本上下文，以识别控制结合和特异性的关键功能残基。这些残基层面的决策随后作为硬约束传递给独立的基于扩散的*生成专家*，后者在遵守固定相互作用锚点的前提下执行条件协同设计。这种分解方式模仿了人类专家处理分子工程的路径：首先推理关键相互作用，然后在这些约束条件下优化几何结构。通过将推理操作化为显式的残基层面承诺而非隐式的文本引导，Proteo-R1实现了大语言模型推理与最新几何生成模型的稳定、可解释且模块化集成。代码、数据和演示可在 https://smiles724.github.io/r1/ 获取。