We study social learning dynamics where the agents collectively follow a simple multi-armed bandit protocol. Agents arrive sequentially, choose arms and receive associated rewards. Each agent observes the full history (arms and rewards) of the previous agents, and there are no private signals. While collectively the agents face exploration-exploitation tradeoff, each agent acts myopically, without regards to exploration. Motivating scenarios concern reviews and ratings on online platforms. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals, including the "unbiased" behavior as well as various behaviorial biases. While extreme versions of these behaviors correspond to well-known bandit algorithms, we prove that more moderate versions lead to stark exploration failures, and consequently to regret rates that are linear in the number of agents. We provide matching upper bounds on regret by analyzing "moderately optimistic" agents. As a special case of independent interest, we obtain a general result on failure of the greedy algorithm in multi-armed bandits. This is the first such result in the literature, to the best of our knowledge
翻译:我们研究了一种社会学习动态,其中智能体集体遵循简单的多臂赌博机协议。智能体依次到达,选择臂并获取相应奖励。每个智能体可观察先前智能体的完整历史(包括所选臂及奖励),且不存在私人信号。尽管集体面临探索-利用权衡,但每个智能体均采取近视行为,忽略探索需求。相关动机场景涉及在线平台上的评论与评分。我们允许一系列与(参数化)置信区间一致的近视行为,包括“无偏”行为及各类行为偏差。尽管极端版本对应著名的赌博机算法,但我们的证明表明,更温和版本会导致严重的探索失败,进而产生与智能体数量成线性关系的遗憾率。通过分析“适度乐观”的智能体,我们给出了遗憾的匹配上界。作为独立兴趣的特例,我们获得了多臂赌博机中贪心算法失败的一般性结论。据我们所知,这是文献中首个此类结果。