Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose sensitivity. Protein language models provide context-aware representations that could improve alignment sensitivity in this regime. However, prior protein embedding-based retrieval pipelines often pool these representations into a single vector, potentially obscuring local motifs, domains, or conserved residues that reveal remote homology. We introduce ProtoCol, a model which represents proteins as sets of residue embeddings and uses ColBERT-style late interaction to test whether residue-level comparison improves homolog retrieval. ProtoCol encodes proteins independently, keeps candidate representations pre-computable, and scores candidates with MaxSim over residue embeddings. On SCOPe superfamily and Pfam clan benchmarks, ProtoCol outperforms sequence-composition, alignment-based, pooled PLM, and trained single-vector baselines, supporting late interaction as an effective retrieval layer for remote homology search.
翻译:蛋白质同源搜索是功能注释、结构预测和进化分析的基础,但在“模糊区”中仍具有挑战性——该区域全局序列相似性弱,经典比对方法灵敏度不足。蛋白质语言模型提供了上下文感知的表示,可提升该区域的比对灵敏度。然而,现有基于蛋白质嵌入的检索流程通常将这些表示池化为单一向量,可能掩盖揭示远缘同源性的局部基序、结构域或保守残基。我们提出ProtoCol模型,该模型将蛋白质表示为残基嵌入集合并采用ColBERT风格的后期交互策略,测试残基级比较是否能改善同源检索效果。ProtoCol独立编码蛋白质,支持候选表示预计算,并通过残基嵌入上的MaxSim对候选序列进行评分。在SCOPe超家族和Pfam clan基准测试中,ProtoCol的表现超越了基于序列组成、比对方法、池化PLM及训练后单向量基线模型,验证了后期交互作为远缘同源搜索的有效检索层。