Offline learning has become widely used due to its ability to derive effective policies from offline datasets gathered by expert demonstrators without interacting with the environment directly. Recent research has explored various ways to enhance offline learning efficiency by considering the characteristics (e.g., expertise level or multiple demonstrators) of the dataset. However, a different approach is necessary in the context of zero-sum games, where outcomes vary significantly based on the strategy of the opponent. In this study, we introduce a novel approach that uses unsupervised learning techniques to estimate the exploited level of each trajectory from the offline dataset of zero-sum games made by diverse demonstrators. Subsequently, we incorporate the estimated exploited level into the offline learning to maximize the influence of the dominant strategy. Our method enables interpretable exploited level estimation in multiple zero-sum games and effectively identifies dominant strategy data. Also, our exploited level augmented offline learning significantly enhances the original offline learning algorithms including imitation learning and offline reinforcement learning for zero-sum games.
翻译:离线学习因其能够从专家演示者收集的离线数据集中提取有效策略而无需直接与环境交互,已得到广泛应用。近期研究通过考虑数据集特性(例如专业水平或多位演示者)探索了多种提升离线学习效率的方法。然而,在零和博弈中需要不同的处理方式,因为博弈结果会因对手策略的不同而产生显著差异。本研究提出一种新方法,利用无监督学习技术从多类演示者生成的零和博弈离线数据集中,估计每条轨迹的利用水平。随后,将估计的利用水平融入离线学习过程,以最大化支配策略的影响力。我们的方法能够在多种零和博弈中实现可解释的利用水平估计,并有效识别支配策略数据。此外,所提出的利用水平增强离线学习显著提升了原始离线学习算法(包括模仿学习与离线强化学习)在零和博弈中的性能。