We present an AI Co-Scientist framework that closes the research loop for the production search-ranking system of a large online travel platform -- pairing LLM agents with direct cloud-compute access so that idea generation, code implementation, GPU experimentation, and result analysis iterate end-to-end with a human scientist in the loop. The framework uses a hybrid agent architecture: single-LLM agents handle routine work, while multi-LLM consensus (GPT-5.2, Gemini Pro 3, Claude Opus 4.5) is invoked for higher-stakes decisions. On the production ranking task, a human-designed transformer baseline (V2) yielded $+0.118\%$ over a pre-transformer baseline (V1); the AI Co-Scientist's automated loop on top of V2 contributed an additional $+0.083\%$, for a combined $+0.201\%$ offline gain delivered in roughly one extra week of wall-clock time (single-run numbers; statistical limits discussed in the paper). The most useful AI proposals -- unified long-sequence layouts, slot-type embeddings, and multi-phase learning-rate schedules -- are standard practice in NLP and Vision but were absent from our production stack, suggesting that LLM agents can serve as cross-disciplinary connectors for ranking teams. We also report deployment context, negative results, and lessons learned.
翻译:我们提出了一种AI协同科学家框架,实现了大型在线旅游平台生产搜索排名系统的研究闭环——该框架将LLM智能体与直接的云计算访问配对,使得思路生成、代码实现、GPU实验和结果分析能够在人机协同下实现端到端迭代。该框架采用混合智能体架构:单LLM智能体处理常规任务,而多LLM共识机制(GPT-5.2、Gemini Pro 3、Claude Opus 4.5)则用于更高风险决策。在生产排名任务中,人工设计的Transformer基线(V2)相比前Transformer基线(V1)带来了+0.118%的提升;AI协同科学家在V2基础上进行的自动化循环额外贡献了+0.083%的提升,最终在约一周额外挂钟时间(单次运行数值;论文中讨论了统计极限)内实现了总计+0.201%的离线收益。最有价值的AI提案——统一长序列布局、槽位类型嵌入和多阶段学习率调度方案——在自然语言处理和视觉领域已是标准实践,但此前并未出现在我们的生产系统中,这表明LLM智能体能够充当排名团队的跨学科连接器。我们还报告了部署环境、负面结果以及经验教训。