Protein Thoughts: Interpretable Reasoning with Tree of Thoughts and Embedding-Space Flow Matching for Protein-Protein Interaction Discovery

Protein-protein interactions (PPIs) govern nearly all cellular processes, yet computational methods for identifying binding partners typically produce ranked predictions without mechanistic justification. This creates a fundamental barrier to adoption because biologists cannot assess whether predictions reflect genuine biochemical insight or spurious correlations. We present \textbf{Protein Thoughts}, a framework that reformulates PPI discovery as an interpretable search problem with explicit reasoning. The system decomposes binding evidence into four biologically meaningful signals: sequence similarity reflecting evolutionary relationships, structural complementarity capturing geometric fit, interface balance, and chemical compatibility encoding residue-level interactions. Rather than collapsing these signals into an opaque score, we preserve their individual contributions through a transparent value function that enables both ranking and auditing. To navigate large candidate spaces efficiently, we introduce hypothesis-guided entropy-regularized Tree-of-Thoughts search. A fine-tuned language model generates search directives from embedding-derived features, classifying candidates as high-priority, exploratory, or skippable. These directives condition a Boltzmann policy that balances exploitation with entropy-driven exploration, while hypothesis-aware pruning prevents premature abandonment of promising candidates. For candidates exhibiting score disagreement, hypothesis-conditioned embedding-space flow matching transports protein embeddings toward the binder manifold. On the SHS148k benchmark, Protein Thoughts achieves mean best-binder rank of 11.2 versus 47.7 for an entropic tree search baseline, a 76% improvement, and for binding prediction the trained value function achieves $91.08 \pm 0.19$ Micro-F1, outperforming existing PPI methods on the same dataset.

翻译：蛋白质-蛋白质相互作用（PPI）调控几乎所有细胞过程，但用于识别结合伙伴的计算方法通常仅生成排序预测，缺乏机制性解释。这构成了根本性障碍，因为生物学家无法评估预测结果是否反映真实的生化洞见还是虚假关联。我们提出\textbf{蛋白质思想}（Protein Thoughts）框架，将PPI发现重新定义为具有显式推理的可解释搜索问题。该系统将结合证据分解为四种生物学意义信号：反映进化关系的序列相似性、捕捉几何匹配的结构互补性、界面平衡性以及编码残基级相互作用的化学兼容性。我们不将这些信号合并为不透明的分数，而是通过透明的价值函数保留各自贡献，从而支持排序与审计。为高效遍历大规模候选空间，我们引入假设引导的熵正则化思维树（Tree-of-Thoughts）搜索。经微调的语言模型从嵌入特征生成搜索指令，将候选对象分类为高优先级、探索性或可跳过类别。这些指令条件化一个平衡利用与熵驱动探索的玻尔兹曼策略，同时假设感知剪枝避免过早放弃有前景的候选对象。对于存在分数不一致性的候选对象，假设条件化的嵌入空间流匹配将蛋白质嵌入迁移至结合子流形。在SHS148k基准测试中，蛋白质思想实现11.2的平均最优结合排名，相比熵树搜索基线的47.7提升76%；在结合预测任务中，训练后的价值函数达到$91.08 \pm 0.19$的Micro-F1分数，在相同数据集上优于现有PPI方法。