We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d.\ from an unknown prior distribution, and learns its meta-parameters to perform better on future tasks. We propose the first Bayesian and frequentist meta-learning algorithms for this setting. The Bayesian algorithm has access to a prior distribution over the meta-parameters and its meta simple regret over $m$ bandit tasks with horizon $n$ is mere $\tilde{O}(m / \sqrt{n})$. On the other hand, the meta simple regret of the frequentist algorithm is $\tilde{O}(\sqrt{m} n + m/ \sqrt{n})$. While its regret is worse, the frequentist algorithm is more general because it does not need a prior distribution over the meta-parameters. It can also be analyzed in more settings. We instantiate our algorithms for several classes of bandit problems. Our algorithms are general and we complement our theory by evaluating them empirically in several environments.
翻译:我们提出了一种针对多臂老虎机中简单遗憾最小化的元学习框架。在该框架中,学习代理与一系列独立同分布于未知先验分布的老虎机任务进行交互,并学习其元参数以在未来任务中表现更优。我们为此场景提出了首个贝叶斯和频率派元学习算法。贝叶斯算法可访问元参数上的先验分布,其在$m$个时间范围为$n$的老虎机任务上的元简单遗憾仅为$\tilde{O}(m / \sqrt{n})$。而频率派算法的元简单遗憾为$\tilde{O}(\sqrt{m} n + m/ \sqrt{n})$。尽管其遗憾值较大,但频率派算法更具通用性,因为它无需元参数上的先验分布,且可在更多场景下进行分析。我们针对多类老虎机问题实例化了所提算法。这些算法具有普适性,我们通过多个环境中的实证评估对理论进行了补充。