We study online meta-learning with bandit feedback, with the goal of improving performance across multiple tasks if they are similar according to some natural similarity measure. As the first to target the adversarial online-within-online partial-information setting, we design meta-algorithms that combine outer learners to simultaneously tune the initialization and other hyperparameters of an inner learner for two important cases: multi-armed bandits (MAB) and bandit linear optimization (BLO). For MAB, the meta-learners initialize and set hyperparameters of the Tsallis-entropy generalization of Exp3, with the task-averaged regret improving if the entropy of the optima-in-hindsight is small. For BLO, we learn to initialize and tune online mirror descent (OMD) with self-concordant barrier regularizers, showing that task-averaged regret varies directly with an action space-dependent measure they induce. Our guarantees rely on proving that unregularized follow-the-leader combined with two levels of low-dimensional hyperparameter tuning is enough to learn a sequence of affine functions of non-Lipschitz and sometimes non-convex Bregman divergences bounding the regret of OMD.
翻译:我们研究在赌博反馈下的在线元学习,目标是在多个任务通过某种自然相似性度量表现出相似性时,提升其整体性能。作为首个针对对抗性在线-在线部分信息设定进行的研究,我们设计了结合外部学习器的元算法,以同时调整内部学习器的初始化参数和其他超参数,覆盖两种重要情形:多臂赌博机(MAB)和赌博线性优化(BLO)。对于MAB,元学习器负责初始化并设定Exp3的Tsallis熵推广形式的超参数,当事后最优解的熵较小时,任务平均遗憾值将得到改善。对于BLO,我们学习初始化及调节采用自协调势垒正则化的在线镜像下降(OMD)算法,并证明任务平均遗憾值与这些正则化诱导的动作空间相关度量呈直接关联。我们的理论保证依赖于证明:未正则化的跟随领导者算法与两层低维超参数调节相结合,足以学习一系列仿射函数,这些函数基于非Lipschitz且有时非凸的Bregman散度,而这些散度界定了OMD的遗憾值。