We study the problem of extrapolative controlled generation, i.e., generating sequences with attribute values beyond the range seen in training. This task is of significant importance in automated design, especially drug discovery, where the goal is to design novel proteins that are \textit{better} (e.g., more stable) than existing sequences. Thus, by definition, the target sequences and their attribute values are out of the training distribution, posing challenges to existing methods that aim to directly generate the target sequence. Instead, in this work, we propose Iterative Controlled Extrapolation (ICE) which iteratively makes local edits to a sequence to enable extrapolation. We train the model on synthetically generated sequence pairs that demonstrate small improvement in the attribute value. Results on one natural language task (sentiment analysis) and two protein engineering tasks (ACE2 stability and AAV fitness) show that ICE considerably outperforms state-of-the-art approaches despite its simplicity. Our code and models are available at: https://github.com/vishakhpk/iter-extrapolation.
翻译:我们研究了外推控制生成问题,即生成属性值超出训练数据范围的序列。该任务在自动设计领域,尤其是药物发现中具有重要价值,其目标是设计比现有序列更优(例如更稳定)的新型蛋白质。因此,根据定义,目标序列及其属性值均处于训练分布之外,这对旨在直接生成目标序列的现有方法构成了挑战。针对这一问题,我们提出了一种名为迭代控制外推(ICE)的方法,该方法通过对序列进行局部编辑迭代实现外推。我们利用合成生成的序列对训练模型,这些序列对在属性值上表现出轻微改进。在自然语言任务(情感分析)和两项蛋白质工程任务(ACE2稳定性与AAV适应性)上的结果表明,尽管ICE方法简单,但其性能显著优于当前最先进的方法。我们的代码和模型已在https://github.com/vishakhpk/iter-extrapolation 开源。