We study social learning dynamics where the agents collectively follow a simple multi-armed bandit protocol. Agents arrive sequentially, choose arms and receive associated rewards. Each agent observes the full history (arms and rewards) of the previous agents, and there are no private signals. While collectively the agents face exploration-exploitation tradeoff, each agent acts myopically, without regards to exploration. Motivating scenarios concern reviews and ratings on online platforms. We allow a wide range of myopic behaviors that are consistent with (parameterized) confidence intervals, including the "unbiased" behavior as well as various behaviorial biases. While extreme versions of these behaviors correspond to well-known bandit algorithms, we prove that more moderate versions lead to stark exploration failures, and consequently to regret rates that are linear in the number of agents. We provide matching upper bounds on regret by analyzing "moderately optimistic" agents. As a special case of independent interest, we obtain a general result on failure of the greedy algorithm in multi-armed bandits. This is the first such result in the literature, to the best of our knowledge.
翻译:我们研究了一种社会学习动态,其中智能体集体遵循简单的多臂赌博机协议。智能体依次到达,选择臂并接收相关奖励。每个智能体能观察到先前智能体的完整历史(包括臂和奖励),且不存在私有信号。尽管集体面临探索-利用权衡,但每个智能体均短视行事,不关心探索。动机场景涉及在线平台上的评论与评分。我们允许与(参数化)置信区间一致的各种短视行为,包括“无偏”行为及多种行为偏差。尽管这些行为的极端版本对应著名的多臂赌博机算法,但证明更温和的版本会导致严重的探索失败,进而产生与智能体数量呈线性关系的遗憾值。通过分析“适度乐观”的智能体,我们给出了相匹配的遗憾值上界。作为独立兴趣的特例,我们获得了贪婪算法在多臂赌博机中失效的一般性结果。据我们所知,这是文献中首次出现此类结果。