We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.
翻译:我们提出了有力证据,证明将大型语言模型(LLMs)与上下文多臂老虎机框架相结合具有显著优势。上下文老虎机已在推荐系统中广泛应用,用于根据用户特定上下文生成个性化建议。研究表明,在富含人类知识和偏好的大规模语料库上预训练的LLMs能够充分模拟人类行为,从而快速启动上下文多臂老虎机以减少在线学习遗憾。我们提出一种上下文老虎机初始化算法,通过提示LLMs生成老虎机近似人类偏好的预训练数据集。这显著降低了在线学习遗憾以及训练此类模型的数据收集成本。我们的方法通过两组不同老虎机设置的实验得到实证验证:一组利用LLMs作为预测器,另一组则采用联合调查实验的真实数据进行现实场景验证。