Personalized AI-based services involve a population of individual reinforcement learning agents. However, most reinforcement learning algorithms focus on harnessing individual learning and fail to leverage the social learning capabilities commonly exhibited by humans and animals. Social learning integrates individual experience with observing others' behavior, presenting opportunities for improved learning outcomes. In this study, we focus on a social bandit learning scenario where a social agent observes other agents' actions without knowledge of their rewards. The agents independently pursue their own policy without explicit motivation to teach each other. We propose a free energy-based social bandit learning algorithm over the policy space, where the social agent evaluates others' expertise levels without resorting to any oracle or social norms. Accordingly, the social agent integrates its direct experiences in the environment and others' estimated policies. The theoretical convergence of our algorithm to the optimal policy is proven. Empirical evaluations validate the superiority of our social learning method over alternative approaches in various scenarios. Our algorithm strategically identifies the relevant agents, even in the presence of random or suboptimal agents, and skillfully exploits their behavioral information. In addition to societies including expert agents, in the presence of relevant but non-expert agents, our algorithm significantly enhances individual learning performance, where most related methods fail. Importantly, it also maintains logarithmic regret.
翻译:个性化人工智能服务涉及由个体强化学习智能体组成的群体。然而,大多数强化学习算法侧重于利用个体学习,未能充分利用人类和动物普遍展现的社交学习能力。社交学习将个体经验与观察他人行为相结合,为提升学习效果提供了可能。在本研究中,我们聚焦于一种社交赌博机学习场景:社交智能体能够观察其他智能体的行为,但无法获知其奖励信息。各智能体独立执行自身策略,并无明确相互指导的动机。我们提出了一种基于自由能的策略空间社交赌博机学习算法,其中社交智能体无需依赖任何先知或社会规范即可评估其他智能体的专业水平。基于此,社交智能体将其在环境中的直接经验与他人的估计策略进行整合。我们证明了该算法在理论上能够收敛至最优策略。实证评估验证了我们的社交学习方法在多种场景下均优于其他替代方案。即使在存在随机或次优智能体的情况下,我们的算法仍能策略性地识别相关智能体,并巧妙利用其行为信息。除了包含专家智能体的社群外,在存在相关但非专家智能体的情况下,我们的算法显著提升了个体学习性能,而多数相关方法在此类场景中均告失效。重要的是,该算法同时保持了对数级遗憾。