Protein language models (PLMs) face a fundamental divide: masked language models (MLMs) excel at fitness prediction while causal models enable generation, forcing practitioners to maintain separate architectures. We introduce \textbf{Proust}, a 309M-parameter causal PLM that bridges this gap through architectural innovations adapted from recent LLM research, including grouped-query attention with shared K/V projections, cross-layer value residuals, and depthwise causal convolutions. Trained on 33B tokens in 40 B200 GPU-hours, Proust achieves Spearman $ρ= 0.390$ on ProteinGym substitutions, competitive with MLMs requiring 50--200$\times$ the compute. On indels, Proust sets a new state-of-the-art, outperforming models up to 20$\times$ larger. On EVEREST viral fitness benchmarks, it approaches structure-aware methods using sequence alone. These powerful representations position Proust in a sweet spot as it also retains native generative capabilities that MLMs lack by design. Interpretability analysis reveals that per-position entropy variance predicts, to an extent, when retrieval augmentation helps and hurts. Such insights can grow in both quantity and quality at scale and inform capabilities such as test-time scaling. Code and weights are available at https://github.com/Furkan9015/proust-inference
翻译:蛋白质语言模型面临一个根本性分歧:掩码语言模型擅长适应度预测,而因果模型支持生成任务,迫使研究者必须维护两套独立架构。我们提出\textbf{Proust}——一个3.09亿参数的因果蛋白质语言模型,通过借鉴近期大语言模型研究的架构创新弥合了这一鸿沟,包括采用共享键值投影的分组查询注意力机制、跨层值残差连接以及深度因果卷积。该模型在40个B200 GPU小时内基于330亿标记进行训练,在ProteinGym替换任务上达到Spearman $ρ= 0.390$,与需要50-200倍计算量的掩码语言模型性能相当。在插入缺失变异任务中,Proust创造了新的最优结果,其性能超越参数量达20倍的大型模型。在EVEREST病毒适应度基准测试中,仅使用序列信息即可逼近基于结构感知的方法。这些强大的表征能力使Proust处于优势平衡点,同时保留了掩码语言模型设计上缺失的原生生成能力。可解释性分析表明,位置熵方差能在一定程度上预测检索增强何时产生助益或损害。此类洞见可在规模化过程中实现质与量的同步提升,并为测试时缩放等能力提供理论依据。代码与权重已发布于https://github.com/Furkan9015/proust-inference