GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians

Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data, involving the tasks of dataset selection, preprocessing, and statistical analysis. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, in a full analysis pipeline that follows the standard of computational genomics. These annotations are curated by human bioinformaticians who carefully analyze the datasets to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation to collaboratively explore gene datasets. Our experiments with GenoAgents demonstrate the potential of LLM-based approaches in genomics data analysis, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing AI-driven methods for genomics data analysis. We make our benchmark publicly available at \url{https://github.com/Liu-Hy/GenoTex}.

翻译：机器学习的最新进展显著提升了从基因表达数据集中识别疾病相关基因的能力。然而，这些过程通常需要大量专业知识和人工操作，限制了其可扩展性。基于大语言模型（LLM）的智能体因其日益增强的问题解决能力，在自动化这些任务方面展现出潜力。为支持此类方法的评估与发展，我们提出了GenoTEX——一个用于基因表达数据自动探索的基准数据集，涵盖数据集选择、预处理和统计分析等任务。GenoTEX为广泛基因识别问题的解决提供了标注代码与结果，其完整分析流程遵循计算基因组学标准。这些标注由人类生物信息学家精心编制，通过对数据集的细致分析确保准确性与可靠性。为这些任务提供基线，我们提出了GenoAgents——一个基于LLM的智能体团队，其设计融合了上下文感知规划、迭代校正和领域专家咨询机制，以协作探索基因数据集。通过GenoAgents的实验，我们验证了基于LLM的方法在基因组数据分析中的潜力，同时误差分析揭示了当前面临的挑战与未来改进方向。我们提出GenoTEX作为评估和增强基因组数据分析中AI驱动方法的有前景的资源。本基准已公开于 \url{https://github.com/Liu-Hy/GenoTex}。