Prompt-based offline methods are commonly used to optimize large language model (LLM) responses, but evaluating these responses is computationally intensive and often fails to accommodate diverse response styles. This study introduces a novel online evaluation framework that employs a multi-agent conversational bandit model to select optimal responses while aligning with user preferences dynamically. To tackle challenges such as high-dimensional features, large response sets, adaptive conversational needs, and multi-device access, we propose MACO, Multi-Agent Conversational Online Learning, which comprises two key components: (1) \texttt{MACO-A}: Executed by local agents, it employs an online elimination mechanism to filter out low-quality responses. (2) \texttt{MACO-S}: Executed by the cloud server, it adaptively adjusts selection strategies based on aggregated preference data. An adaptive preference mechanism triggers asynchronous conversations to enhance alignment efficiency. Theoretical analysis demonstrates that MACO achieves near-optimal regret bounds, matching state-of-the-art performance in various degenerate cases. Extensive experiments utilizing Google and OpenAI text embedding models on the real-world datasets with different response styles, combined with Llama and GPT-4o, show that MACO consistently outperforms baseline methods by at least 8.29\% across varying response set sizes and numbers of agents.
翻译:基于提示的离线方法通常用于优化大语言模型(LLM)的响应,但评估这些响应计算成本高昂,且往往难以适应多样化的响应风格。本研究提出了一种新颖的在线评估框架,采用多智能体对话赌博机模型,动态选择符合用户偏好的最优响应。为应对高维特征、大规模响应集、自适应对话需求及多设备访问等挑战,我们提出了MACO(多智能体对话在线学习),其包含两个关键组件:(1)\\texttt{MACO-A}:由本地智能体执行,采用在线淘汰机制过滤低质量响应。(2)\\texttt{MACO-S}:由云端服务器执行,基于聚合的偏好数据自适应调整选择策略。自适应偏好机制触发异步对话以提升对齐效率。理论分析表明,MACO实现了接近最优的遗憾界,在各种退化场景中匹配了最先进的性能。利用Google和OpenAI文本嵌入模型在真实数据集(含不同响应风格)上的大量实验,结合Llama和GPT-4o,结果显示MACO在不同响应集规模和智能体数量下,始终优于基线方法至少8.29%。