In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.
翻译:本文提出EEVEE——首个面向大语言模型智能体的多数据集测试时提示学习框架,可在真实世界任务流条件下实现测试时提示学习。现有方法主要针对单数据集场景设计,而实际应用要求模型能够处理来自多个数据集、领域和任务分布的异构输入流,这限制了其实际可用性。为缓解跨数据集干扰问题,EEVEE引入路由模块将输入数据划分为任务簇并分配至适配的提示配置。该设计通过路由-提示协同进化策略进行优化,采用交替的路由学习和提示学习阶段解决两者的相互依赖关系。跨多个数据集的实验表明,该框架在保持单基准学习能力和效率的同时,提升了异构数据流下的鲁棒性。具体而言,EEVEE在Qwen3-4B-Instruct和DeepSeek-V3.2上的平均多基准评分分别提升10.38分和24.32分,相比当前最优方法GEPA和ACE分别提升37.2%和48.2%。