We study a multi-objective multi-armed bandit problem in a dynamic environment. The problem portrays a decision-maker that sequentially selects an arm from a given set. If selected, each action produces a reward vector, where every element follows a piecewise-stationary Bernoulli distribution. The agent aims at choosing an arm among the Pareto optimal set of arms to minimize its regret. We propose a Pareto generic upper confidence bound (UCB)-based algorithm with change detection to solve this problem. By developing the essential inequalities for multi-dimensional spaces, we establish that our proposal guarantees a regret bound in the order of $\gamma_T\log(T/{\gamma_T})$ when the number of breakpoints $\gamma_T$ is known. Without this assumption, the regret bound of our algorithm is $\gamma_T\log(T)$. Finally, we formulate an energy-efficient waveform design problem in an integrated communication and sensing system as a toy example. Numerical experiments on the toy example and synthetic and real-world datasets demonstrate the efficiency of our policy compared to the current methods.
翻译:我们研究了动态环境下的多目标多臂赌博机问题。该问题描述了一个决策者从给定集合中顺序选择臂的过程。若选中某个臂,每次动作会产生一个奖励向量,其中每个元素服从分段平稳的伯努利分布。智能体的目标是从帕累托最优臂集中选择一个臂以最小化其遗憾值。我们提出了一种基于变化检测的帕累托通用上置信界(UCB)算法来解决该问题。通过发展多维空间的关键不等式,我们证明当断点数量$\gamma_T$已知时,所提算法的遗憾值界为$\gamma_T\log(T/{\gamma_T})$。在无此假设的情况下,算法遗憾值界为$\gamma_T\log(T)$。最后,我们将集成通信与感知系统中的能量高效波形设计问题作为示例进行形式化建模。基于该示例以及合成数据集与真实数据集的数值实验表明,与现有方法相比,我们的策略具有更优性能。