We study a stylized social learning dynamics where self-interested agents collectively follow a simple multi-armed bandit protocol. Each agent controls an ``episode": a short sequence of consecutive decisions. Motivating applications include users repeatedly interacting with an AI, or repeatedly shopping at a marketplace. While agents are incentivized to explore within their respective episodes, we show that the aggregate exploration fails: e.g., its Bayesian regret grows linearly over time. In fact, such failure is a (very) typical case, not just a worst-case scenario. This conclusion persists even if an agent's per-episode utility is some fixed function of the per-round outcomes: e.g., $\min$ or $\max$, not just the sum. Thus, externally driven exploration is needed even when some amount of exploration happens organically.
翻译:我们研究了一种简化的社会学习动态,其中自利主体集体遵循一个简单的多臂赌博机协议。每个主体控制一个“探索期”:即一系列连续的决策短序列。激励性应用包括用户反复与人工智能交互,或反复在市场中购物。尽管主体在各自探索期内有激励进行探索,但我们证明聚合探索是失败的:例如,其贝叶斯遗憾随时间线性增长。事实上,这种失败是一种(非常)典型的情况,而不仅仅是最坏情况下的结果。即使主体的每期效用是每轮结果的某个固定函数(例如,最小值或最大值,而不仅仅是总和),这一结论依然成立。因此,即使在一定程度上探索自然地发生,仍需要外部驱动的探索。