Contextualizing biological perturbation experiments through language

High-content perturbation experiments allow scientists to probe biomolecular systems at unprecedented resolution, but experimental and analysis costs pose significant barriers to widespread adoption. Machine learning has the potential to guide efficient exploration of the perturbation space and extract novel insights from these data. However, current approaches neglect the semantic richness of the relevant biology, and their objectives are misaligned with downstream biological analyses. In this paper, we hypothesize that large language models (LLMs) present a natural medium for representing complex biological relationships and rationalizing experimental outcomes. We propose PerturbQA, a benchmark for structured reasoning over perturbation experiments. Unlike current benchmarks that primarily interrogate existing knowledge, PerturbQA is inspired by open problems in perturbation modeling: prediction of differential expression and change of direction for unseen perturbations, and gene set enrichment. We evaluate state-of-the-art machine learning and statistical approaches for modeling perturbations, as well as standard LLM reasoning strategies, and we find that current methods perform poorly on PerturbQA. As a proof of feasibility, we introduce Summer (SUMMarize, retrievE, and answeR, a simple, domain-informed LLM framework that matches or exceeds the current state-of-the-art. Our code and data are publicly available at https://github.com/genentech/PerturbQA.

翻译：高内涵扰动实验使科学家能够以前所未有的分辨率探索生物分子系统，但实验和分析成本构成了广泛采用的重要障碍。机器学习有潜力指导对扰动空间的高效探索，并从这些数据中提取新颖见解。然而，当前方法忽略了相关生物学的语义丰富性，且其目标与下游生物学分析存在偏差。在本文中，我们假设大型语言模型（LLMs）为表示复杂的生物学关系和合理化实验结果提供了一个天然的媒介。我们提出了PerturbQA，一个用于对扰动实验进行结构化推理的基准。与主要探究现有知识的当前基准不同，PerturbQA的灵感来源于扰动建模中的开放性问题：对未见扰动的差异表达和方向变化的预测，以及基因集富集分析。我们评估了用于建模扰动的最先进的机器学习和统计方法，以及标准的LLM推理策略，发现当前方法在PerturbQA上表现不佳。作为可行性证明，我们引入了Summer（SUMMarize, retrievE, and answeR），一个简单、具备领域知识的LLM框架，其性能达到或超越了当前最先进水平。我们的代码和数据公开在 https://github.com/genentech/PerturbQA。

相关内容

Machine Learning

关注 2249

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【WSDM2020】超越统计关系：将知识关系整合到多标签音乐风格分类的风格关联中（附pdf）

专知会员服务

18+阅读 · 2019年11月23日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日