Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: https://github.com/PennShenLab/FREEFORM.
翻译:基于少量可解释的变异特征预测具有复杂遗传基础的表型仍然是一项具有挑战性的任务。传统上,该任务采用数据驱动方法,但基因型数据的高维特性使得分析和预测变得困难。受预训练大型语言模型所编码的广泛知识及其在处理复杂生物医学概念方面成功的启发,我们着手研究大型语言模型在表格型基因型数据特征选择与工程方面的能力,并提出了一种新颖的知识驱动框架。我们开发了FREEFORM(Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling),该框架基于思维链和集成原则设计,旨在利用大型语言模型的内在知识来选择和构建特征。在两个不同的基因型-表型数据集(遗传祖先和遗传性听力损失)上进行评估后,我们发现该框架优于多种数据驱动方法,尤其在低样本量场景下表现突出。FREEFORM已在GitHub上作为开源框架提供:https://github.com/PennShenLab/FREEFORM。