GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

Characterizing the physiological life boundaries of microbial strains, including viable temperature, pH, salinity, substrate utilization, and morphology, is central to biotechnology and ecology, yet traditionally requires exhaustive in vitro screening. Existing computational approaches either treat physiological traits as isolated supervised targets or repurpose biological foundation models as static encoders, leaving the genotype-to-physiology gap largely unbridged. We formulate microbial life-boundary prediction as a unified genome-to-physiology task and address it with a genome-conditioned, tool-augmented LLM agent. To support this task, we curate a strain-centric benchmark from IJSEM, NCBI, and BacDive covering 1,525 strains and 6,448 instances across viability intervals, environmental optima, substrate utilization, categorical traits, and morphology. Architecturally, the agent injects frozen LucaOne genome embeddings into a Qwen backbone via lightweight token fusion, and reasons over a similarity-based RAG module and a Genome-scale Metabolic Model (GEM) perturbation tool. We optimize the agent through a three-stage pipeline of gene-text alignment, agentic SFT on distilled trajectories, and GRPO with a novel counterfactual gene-grounding reward that reinforces the policy only when the authentic genome embedding causally improves correct-token generation relative to a zero-gene ablation. The resulting 4B-parameter agent matches or surpasses substantially larger frontier LLMs, with ablations confirming that genome-token fusion, dynamic tool use, and the counterfactual reward each yield distinct, significant gains.

翻译：表征微生物菌株的生理生命边界——包括生存温度、pH值、盐度、底物利用能力和形态特征——在生物技术和生态学研究中至关重要，但传统方法依赖费时的体外筛选。现有计算方法要么将生理性状视为孤立的监督学习目标，要么将生物基础模型重新用作静态编码器，使得基因型到生理表型的鸿沟未得到充分弥合。我们将微生物生命边界预测统一表述为基因组到生理表型的任务，并基于基因组条件化的工具增强型大语言模型智能体加以解决。为支撑该任务，我们基于IJSEM、NCBI和BacDive数据库构建了一个以菌株为中心的基准数据集，涵盖1525个菌株和6448个实例，涉及生存区间、环境最适值、底物利用、分类性状和形态特征。在架构上，该智能体通过轻量级令牌融合将冻结的LucaOne基因组嵌入注入Qwen骨干网络，并基于相似性检索增强生成模块和基因组尺度代谢模型扰动工具进行推理。我们通过三阶段流水线优化该智能体：基因-文本对齐、基于蒸馏轨迹的智能体监督微调，以及采用新型反事实基因锚定奖励的GRPO算法——该奖励仅在真实基因组嵌入相对于零基因消融能因果性提升正确令牌生成时才强化策略。最终4B参数智能体的性能可比肩甚至超越更大的前沿大语言模型，消融实验证实基因组-令牌融合、动态工具使用和反事实奖励各自均产生显著且独特的性能增益。