In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments.
翻译:在多臂赌博机等决策问题中,智能体通过优化特定反馈信号进行序贯学习。虽然均值回报准则已得到广泛研究,但对于医疗、农业等关键应用场景,反映对不利结果规避程度的其他度量指标(如均值-方差或条件风险价值(CVaR))亦具有重要价值。针对无上下文信息的赌博机反馈,已有学者提出了面向此类风险感知度量的算法。本文研究上下文赌博机问题,其中风险度量可通过凸损失最小化建模为上下文的线性函数。适用于该框架的典型示例是期望分位数(expectile)度量,该度量通过求解非对称最小二乘问题获得。利用超鞅混合方法,我们推导了此类风险度量估计的置信序列。进而提出一种乐观UCB算法用于学习最优风险感知动作,其遗憾界与广义线性赌博机类似。该算法每轮需求解凸优化问题,为此我们允许通过在线梯度下降法获得近似解来降低计算复杂度,仅需承担略高的遗憾值。最后,通过数值实验对所提算法进行性能评估。