Probe-then-Commit Multi-Objective Bandits: Theoretical Benefits of Limited Multi-Arm Feedback

We study an online resource-selection problem motivated by multi-radio access selection and mobile edge computing offloading. In each round, an agent chooses among $K$ candidate links/servers (arms) whose performance is a stochastic $d$-dimensional vector (e.g., throughput, latency, energy, reliability). The key interaction is \emph{probe-then-commit (PtC)}: the agent may probe up to $q>1$ candidates via control-plane measurements to observe their vector outcomes, but must execute exactly one candidate in the data plane. This limited multi-arm feedback regime strictly interpolates between classical bandits ($q=1$) and full-information experts ($q=K$), yet existing multi-objective learning theory largely focuses on these extremes. We develop \textsc{PtC-P-UCB}, an optimistic probe-then-commit algorithm whose technical core is frontier-aware probing under uncertainty in a Pareto mode, e.g., it selects the $q$ probes by approximately maximizing a hypervolume-inspired frontier-coverage potential and commits by marginal hypervolume gain to directly expand the attained Pareto region. We prove a dominated-hypervolume frontier error of $\tilde{O} (K_P d/\sqrt{qT})$, where $K_P$ is the Pareto-frontier size and $T$ is the horizon, and scalarized regret $\tilde{O} (L_φd\sqrt{(K/q)T})$, where $φ$ is the scalarizer. These quantify a transparent $1/\sqrt{q}$ acceleration from limited probing. We further extend to \emph{multi-modal probing}: each probe returns $M$ modalities (e.g., CSI, queue, compute telemetry), and uncertainty fusion yields variance-adaptive versions of the above bounds via an effective noise scale.

翻译：我们研究一个由多无线电接入选择与移动边缘计算卸载所激发的在线资源选择问题。在每一轮中，智能体从$K$个候选链路/服务器（臂）中进行选择，每个臂的性能是一个随机的$d$维向量（例如，吞吐量、延迟、能量、可靠性）。关键交互是\emph{探针后提交（PtC）}：智能体可以通过控制平面测量探测多达$q>1$个候选臂以观察其向量结果，但必须在数据平面中恰好执行一个候选臂。这种有限的多臂反馈机制严格地插值于经典赌博机（$q=1$）和完全信息专家（$q=K$）之间，然而现有的多目标学习理论主要集中在这两种极端情况。我们开发了\textsc{PtC-P-UCB}算法，一种乐观的探针后提交算法，其技术核心是在帕累托模式下不确定性下的前沿感知探测，例如，它通过近似最大化一个受超体积启发的前沿覆盖潜力来选择$q$个探测臂，并通过边际超体积增益进行提交，以直接扩展所获得的帕累托区域。我们证明了其支配超体积前沿误差为$\tilde{O} (K_P d/\sqrt{qT})$，其中$K_P$是帕累托前沿大小，$T$是时间范围；以及标量化遗憾为$\tilde{O} (L_φd\sqrt{(K/q)T})$，其中$φ$是标量化函数。这些结果量化了有限探测带来的透明$1/\sqrt{q}$加速。我们进一步扩展到\emph{多模态探测}：每个探测返回$M$种模态（例如，信道状态信息、队列、计算遥测），通过不确定性融合，利用有效噪声尺度产生了上述界限的方差自适应版本。