Protein sequence generation via stochastic attention produces plausible family members from small alignments without training, but treats all stored sequences equally and cannot direct generation toward a functional subset of interest. We show that a single scalar parameter, added as a bias to the sampler's attention logits, continuously shifts generation from the full family toward a user-specified subset, with no retraining and no change to the model architecture. A practitioner supplies a small set of sequences (for example, hits from a binding screen) and a multiplicity ratio that controls how strongly generation favors them. The method is agnostic to what the subset represents: binding, stability, specificity, or any other property. We find that the conditioning is exact at the level of the sampler's internal representation, but that the decoded sequence phenotype can fall short because the dimensionality reduction used to encode sequences does not always preserve the residue-level variation that defines the functional split. We term this discrepancy the calibration gap and show that it is predicted by a simple geometric measure of how well the encoding separates the functional subset from the rest of the family. Experiments on five Pfam families (Kunitz, SH3, WW, Homeobox, and Forkhead domains) confirm the monotonic relationship between separation and gap across a fourfold range of geometries. Applied to omega-conotoxin peptides targeting a calcium channel involved in pain signaling, curated seeding from 23 characterized binders produces over a thousand candidates that preserve the primary pharmacophore and all experimentally identified binding determinants. These results show that stochastic attention enables practitioners to expand a handful of experimentally characterized sequences into diverse candidate libraries without retraining a generative model.
翻译:基于随机注意力机制的蛋白质序列生成方法无需训练即可从小规模比对中生成合理的家族成员,但该方法平等处理所有存储序列,无法将生成过程导向特定功能子集。研究表明,通过向采样器的注意力对数(logits)添加一个标量参数作为偏置,可在不重新训练、不修改模型架构的情况下,连续地将生成过程从整个蛋白质家族转向用户指定的功能子集。使用者只需提供少量序列(例如来自结合筛选的命中序列)和一个控制生成倾向强度的多倍率参数。该方法对子集所代表的特性(结合能力、稳定性、特异性或任何其他属性)具有不可知性。研究发现,条件控制在采样器内部表征层面上是精确的,但由于编码过程中的降维方法未完全保留定义功能差异的残基水平变化,导致解码后的序列表型可能存在偏差。我们将这种偏差称为校准差距,并证明可通过一个简单的几何度量来预测该差距,该度量反映了编码对功能子集与家族其他成员的分离程度。在五个Pfam家族(库尼茨结构域、SH3结构域、WW结构域、同源异形结构域和叉头结构域)上的实验验证了分离程度与校准差距之间的单调关系,其几何尺度跨越四倍范围。针对参与疼痛信号传导的钙离子通道的ω-芋螺毒素肽,利用23个已表征结合物进行定向种子扩展,成功生成超过一千个候选序列,这些序列保留了主要药效团及所有实验鉴定的结合决定因子。这些结果表明,无需重新训练生成模型,随机注意力机制即可使研究人员将少量实验表征的序列扩展为多样化的候选文库。