Single-cell perturbation studies face dual heterogeneity bottlenecks: (i) semantic heterogeneity--identical biological concepts encoded under incompatible metadata schemas across datasets; and (ii) statistical heterogeneity--distribution shifts from biological variation demanding dataset-specific inductive biases. We propose HarmonyCell, an end-to-end agent framework resolving each challenge through a dedicated mechanism: an LLM-driven Semantic Unifier autonomously maps disparate metadata into a canonical interface without manual intervention; and an adaptive Monte Carlo Tree Search engine operates over a hierarchical action space to synthesize architectures with optimal statistical inductive biases for distribution shifts. Evaluated across diverse perturbation tasks under both semantic and distribution shifts, HarmonyCell achieves a 95% valid execution rate on heterogeneous input datasets (versus 0% for general agents) while matching or even exceeding expert-designed baselines in rigorous out-of-distribution evaluations. This dual-track orchestration enables scalable automatic virtual cell modeling without dataset-specific engineering.
翻译:单细胞扰动研究面临双重异质性瓶颈:(i)语义异质性——相同生物学概念在不同数据集中以不兼容的元数据模式编码;(ii)统计异质性——源于生物变异的分布偏移需要数据集特定的归纳偏置。我们提出HarmonyCell,这是一个端到端的智能体框架,通过专用机制解决每个挑战:LLM驱动的语义统一器无需人工干预即可将异构元数据自主映射到规范接口;自适应蒙特卡洛树搜索引擎在分层动作空间上运行,以合成具有针对分布偏移最优统计归纳偏置的架构。在语义和分布偏移下的多样化扰动任务评估中,HarmonyCell在异构输入数据集上达到95%的有效执行率(而通用智能体为0%),同时在严格的分布外评估中达到甚至超越专家设计的基线。这种双轨协调机制实现了无需数据集特定工程的可扩展自动虚拟细胞建模。