We study online meta-learning with bandit feedback, with the goal of improving performance across multiple tasks if they are similar according to some natural similarity measure. As the first to target the adversarial online-within-online partial-information setting, we design meta-algorithms that combine outer learners to simultaneously tune the initialization and other hyperparameters of an inner learner for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-learners initialize and set hyperparameters of the Tsallis-entropy generalization of Exp3, with the task-averaged regret improving if the entropy of the optima-in-hindsight is small. For BLO, we learn to initialize and tune online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with an action space-dependent measure they induce. Our guarantees rely on proving that unregularized follow-the-leader combined with two levels of low-dimensional hyperparameter tuning is enough to learn a sequence of affine functions of non-Lipschitz and sometimes non-convex Bregman divergences bounding the regret of OMD.
翻译:我们研究带有赌博机反馈的在线元学习,目标是通过利用任务间根据某种自然相似度度量存在的相似性,提升跨多个任务的性能。作为首个针对对抗性在线-在线部分信息设置的研究,我们设计了结合外部学习器的元算法,能够同时调整内部学习器的初始化参数和其他超参数,适用于两种重要情形:多臂赌博机(MAB)和赌博机线性优化(BLO)。对于MAB,元学习器初始化并设置Tsallis熵广义Exp3算法的超参数,当事后最优解的熵较小时,任务平均遗憾值得到改善。对于BLO,我们学习初始化并调节带有自和谐障碍正则项的在线镜像下降算法(OMD),证明任务平均遗憾值与由这些正则项诱导的动作空间相关度量成正比。我们的理论保证依赖于证明:无需正则化的"跟随领先者"算法结合两个层次的低维超参数调整,足以学习一组非Lipschitz且有时非凸的Bregman散度的仿射函数,这些函数构成了OMD遗憾值的上界。