Machine learning algorithms are often repeatedly applied to problems with similar structure over and over again. We focus on solving a sequence of bandit optimization tasks and develop LIBO, an algorithm which adapts to the environment by learning from past experience and becomes more sample-efficient in the process. We assume a kernelized structure where the kernel is unknown but shared across all tasks. LIBO sequentially meta-learns a kernel that approximates the true kernel and solves the incoming tasks with the latest kernel estimate. Our algorithm can be paired with any kernelized or linear bandit algorithm and guarantees oracle optimal performance, meaning that as more tasks are solved, the regret of LIBO on each task converges to the regret of the bandit algorithm with oracle knowledge of the true kernel. Naturally, if paired with a sublinear bandit algorithm, LIBO yields a sublinear lifelong regret. We also show that direct access to the data from each task is not necessary for attaining sublinear regret. We propose F-LIBO, which solves the lifelong problem in a federated manner.
翻译:机器学习算法常被重复应用于结构相似的问题序列中。本文聚焦于解决序列化的老虎机优化任务,提出LIBO算法,该算法通过从过往经验中学习来适应环境,并在过程中提升样本效率。我们假设存在一种核化结构,其中核函数未知但在所有任务间共享。LIBO通过顺序元学习不断逼近真实核函数的估计值,并利用最新核估计结果处理后续任务。该算法可与任意核化或线性老虎机算法结合,保证达到最优性能,即随着解决任务数量的增加,LIBO在每个任务上的遗憾值收敛于已知真实核函数条件下老虎机算法的遗憾值。理论上,若与次线性老虎机算法结合,LIBO可实现终身次线性遗憾。研究还表明:为达成次线性遗憾,无需直接访问各任务的原始数据。我们进一步提出F-LIBO,以联邦学习方式解决终身优化问题。