PPI2Text: Captioning Protein-Protein Interactions with Coordinate-Aligned Pair-Map Decoding

Protein-protein interaction (PPI) modeling has been widely studied as a binary or multi-label classification task. While emerging multimodal large language models (LLMs) can now describe single proteins, they remain unable to generate free-form descriptions of interactions between protein pairs. Moving beyond controlled vocabulary annotations, we propose to model PPI using free-text description, enabling richer expressiveness, improved interpretability, and better integration with literature knowledge base. We present PPI2Text, a multimodal LLM for free-form PPI captioning from amino acid sequences, that encodes each protein using ESM3 encoder, constructs a pair map from the two representations to capture interactions across all residue pairs, and autoregressively generates descriptions using a Qwen3 language decoder. We further introduce PaCo-RoPE, a coordinate-aligned positional encoding that aligns each axis of the pair grid with the residue positions of the corresponding protein. In addition, we release PPI2Text-Dataset, a 351k-pair corpus of free-form PPI descriptions aggregated from ten curated biological databases and further synthesized with Gemini under evidence-tiered prompting. PPI2Text consistently outperforms strong baselines across multiple ablation settings and evaluation protocols. It not only achieves higher scores on linguistic metrics against synthesized references, but also excels on factuality metrics, where an LLM-based judge evaluates outputs against raw biological evidence.

翻译：蛋白质-蛋白质相互作用建模通常作为二分类或多标签分类任务进行研究。虽然新兴的多模态大语言模型现已能描述单一蛋白质，但仍无法生成蛋白质对间相互作用的自由形式描述。为突破受控词汇注释的局限，我们提出采用自由文本描述建模蛋白质相互作用，以增强表达丰富性、提升可解释性，并更好地与文献知识库整合。我们提出PPI2Text——一种基于氨基酸序列生成自由形式PPI描述的多模态大语言模型：该模型使用ESM3编码器对各蛋白质进行编码，通过构建两个表征的配对图以捕获所有残基对间的相互作用，并采用Qwen3语言解码器自回归生成描述。我们进一步引入PaCo-RoPE坐标对齐位置编码，该编码将配对网格的每个坐标轴与对应蛋白质的残基位置对齐。此外，我们发布了PPI2Text-Dataset数据集——包含从十个精选生物学数据库汇总、并经基于证据层级提示的Gemini合成增强的35.1万对自由形式PPI描述语料库。在多项消融实验与评估方案中，PPI2Text均稳定超越强基线模型：其不仅在与合成参考文本的语言学指标对比中取得更高分数，更在事实性指标上表现优异——该指标通过基于大语言模型的裁判模型，基于原始生物学证据对输出结果进行评估。