We study a stylized social learning dynamics where self-interested agents collectively follow a simple multi-armed bandit protocol. Each agent controls an ``episode": a short sequence of consecutive decisions. Motivating applications include users repeatedly interacting with an AI, or repeatedly shopping at a marketplace. While agents are incentivized to explore within their respective episodes, we show that the aggregate exploration fails: e.g., its Bayesian regret grows linearly over time. In fact, such failure is a (very) typical case, not just a worst-case scenario. This conclusion persists even if an agent's per-episode utility is some fixed function of the per-round outcomes: e.g., $\min$ or $\max$, not just the sum. Thus, externally driven exploration is needed even when some amount of exploration happens organically.
翻译:我们研究一种程式化的社会学习动态,其中自利主体共同遵循一个简单的多臂老虎机协议。每个主体控制一个“回合”:一段连续的短期决策序列。激励应用场景包括用户与人工智能的重复交互,或在市场中的重复购物行为。虽然主体在其各自回合内具有探索激励,但我们证明总体探索会失败:例如,其贝叶斯遗憾随时间线性增长。实际上,这种失败是(极为)典型的情况,而不仅是最坏场景。即使主体的每回合效用是每轮结果的某个固定函数(例如 $\min$ 或 $\max$,而不仅是求和),该结论依然成立。因此,即使存在一定程度的自发探索,仍需外部驱动的探索机制。