Existing risk-aware multi-armed bandit models typically focus on risk measures of individual options such as variance. As a result, they cannot be directly applied to important real-world online decision making problems with correlated options. In this paper, we propose a novel Continuous Mean-Covariance Bandit (CMCB) model to explicitly take into account option correlation. Specifically, in CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions. The agent's objective is to achieve the best trade-off between reward and risk, measured with option covariance. To capture different reward observation scenarios in practice, we consider three feedback settings, i.e., full-information, semi-bandit and full-bandit feedback. We propose novel algorithms with optimal regrets (within logarithmic factors), and provide matching lower bounds to validate their optimalities. The experimental results also demonstrate the superiority of our algorithms. To the best of our knowledge, this is the first work that considers option correlation in risk-aware bandits and explicitly quantifies how arbitrary covariance structures impact the learning performance. The novel analytical techniques we developed for exploiting the estimated covariance to build concentration and bounding the risk of selected actions based on sampling strategy properties can likely find applications in other bandit analysis and be of independent interests.
翻译:现有的风险感知多臂赌博机模型通常仅关注单个选项的风险测度(如方差),因而无法直接应用于存在选项关联的重要现实在线决策问题。本文提出一种新型连续均值-协方差赌博机(Continuous Mean-Covariance Bandit, CMCB)模型,明确考虑选项间的相关性。具体而言,在CMCB中,学习器依次为给定选项选择权重向量,并根据决策结果观察随机反馈。智能体的目标是在收益与风险之间实现最优权衡,其中风险通过选项协方差衡量。为捕捉实践中不同的收益观测场景,我们考虑了三种反馈设置:全信息反馈、半赌博机反馈和全赌博机反馈。我们提出具有最优遗憾(对数因子内)的新型算法,并通过匹配的下界验证其最优性。实验结果也证明了算法的优越性。据我们所知,这是首个在风险感知赌博机中考虑选项关联,并明确量化任意协方差结构对学习性能影响的工作。我们为利用估计协方差构建置信度、基于采样策略性质约束所选动作风险而发展的新型分析技术,可能在其他赌博机分析中得到应用并具有独立研究价值。