Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.

翻译：基于少量可解释的变异特征预测具有复杂遗传基础的表型仍然是一项具有挑战性的任务。传统上，该任务采用数据驱动方法，但基因型数据的高维特性使得分析和预测变得困难。受预训练大型语言模型所编码的广泛知识及其在处理复杂生物医学概念方面成功的启发，我们着手研究大型语言模型在表格型基因型数据特征选择与工程方面的能力，并提出了一种新颖的知识驱动框架。我们开发了FREEFORM（Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling），该框架基于思维链和集成原则设计，旨在利用大型语言模型的内在知识来选择和构建特征。在两个不同的基因型-表型数据集（遗传祖先和遗传性听力损失）上进行评估后，我们发现该框架优于多种数据驱动方法，尤其在低样本量场景下表现突出。FREEFORM已在GitHub上作为开源框架提供：https://github.com/PennShenLab/FREEFORM。

相关内容

Engineering

关注 7

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日