预算约束专家学习的UCB型算法 (UCB-type Algorithm for Budget-Constrained Expert Learning)

In many modern applications, a system must dynamically choose between several adaptive learning algorithms that are trained online. Examples include model selection in streaming environments, switching between trading strategies in finance, and orchestrating multiple contextual bandit or reinforcement learning agents. At each round, a learner must select one predictor among $K$ adaptive experts to make a prediction, while being able to update at most $M \le K$ of them under a fixed training budget. We address this problem in the \emph{stochastic setting} and introduce \algname{M-LCB}, a computationally efficient UCB-style meta-algorithm that provides \emph{anytime regret guarantees}. Its confidence intervals are built directly from realized losses, require no additional optimization, and seamlessly reflect the convergence properties of the underlying experts. If each expert achieves internal regret $\tilde O(T^\alpha)$, then \algname{M-LCB} ensures overall regret bounded by $\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-\alpha}\,T^\alpha\Bigr)$. To our knowledge, this is the first result establishing regret guarantees when multiple adaptive experts are trained simultaneously under per-round budget constraints. We illustrate the framework with two representative cases: (i) parametric models trained online with stochastic losses, and (ii) experts that are themselves multi-armed bandit algorithms. These examples highlight how \algname{M-LCB} extends the classical bandit paradigm to the more realistic scenario of coordinating stateful, self-learning experts under limited resources.

翻译：在许多现代应用中，系统必须动态地在多个在线训练的自适应学习算法之间进行选择。典型场景包括流式环境中的模型选择、金融领域中交易策略的切换，以及多个上下文赌博机或强化学习智能体的协同调度。在每一轮中，学习者需从$K$个自适应专家中选择一个预测器进行预测，同时在固定训练预算下最多只能更新其中$M \le K$个专家。我们在\emph{随机环境}下研究该问题，提出\algname{M-LCB}——一种计算高效的UCB风格元算法，该算法提供\emph{任意时间遗憾保证}。其置信区间直接基于已实现的损失构建，无需额外优化过程，并能自然反映底层专家的收敛特性。若每个专家实现内部遗憾$\tilde O(T^\alpha)$，则\algname{M-LCB}可确保整体遗憾上界为$\tilde O\!\Bigl(\sqrt{\tfrac{KT}{M}} \;+\; (K/M)^{1-\alpha}\,T^\alpha\Bigr)$。据我们所知，这是在每轮预算约束下同时训练多个自适应专家的首个遗憾保证理论结果。我们通过两个典型案例说明该框架：(i) 在随机损失下在线训练的参数化模型；(ii) 本身为多臂赌博机算法的专家。这些案例揭示了\algname{M-LCB}如何将经典赌博机范式扩展到有限资源下协调具有状态的自学习专家这一更现实的场景。