Machine learning algorithms are often repeatedly applied to problems with similar structure over and over again. We focus on solving a sequence of bandit optimization tasks and develop LIBO, an algorithm which adapts to the environment by learning from past experience and becomes more sample-efficient in the process. We assume a kernelized structure where the kernel is unknown but shared across all tasks. LIBO sequentially meta-learns a kernel that approximates the true kernel and solves the incoming tasks with the latest kernel estimate. Our algorithm can be paired with any kernelized or linear bandit algorithm and guarantees oracle optimal performance, meaning that as more tasks are solved, the regret of LIBO on each task converges to the regret of the bandit algorithm with oracle knowledge of the true kernel. Naturally, if paired with a sublinear bandit algorithm, LIBO yields a sublinear lifelong regret. We also show that direct access to the data from each task is not necessary for attaining sublinear regret. We propose F-LIBO, which solves the lifelong problem in a federated manner.
翻译:机器学习算法常被反复应用于具有相似结构的问题。我们聚焦于解决一系列强盗优化任务,并开发了LIBO算法,该算法通过从过往经验中学习来适应环境,在此过程中提升了样本效率。我们假定存在一种核结构,其中核函数未知但跨所有任务共享。LIBO顺序地元学习一个近似真实核的核函数,并利用最新的核估计来解决新任务。我们的算法可与任何基于核或线性强盗算法结合,并保证达到最优性能——即随着更多任务被解决,LIBO在每个任务上的遗憾将收敛于该强盗算法在知晓真实核函数这一先验知识下的遗憾。自然,若与次线性强盗算法结合,LIBO将实现终身次线性遗憾。此外,我们证明无需直接访问每个任务的数据即可实现次线性遗憾。我们提出F-LIBO,以联邦方式解决终身学习问题。