Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show that under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Experiments are conducted to corroborate our theoretical findings. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
翻译:出于对在线决策在每个时间步可能引发过度风险的担忧,本文提出了“非保时风险限定随机组合半赌博”问题。在该问题中,智能体可从 $L$ 个基础物品中选择至多 $K$ 个物品的子集。每个物品具有确定的平均收益及其代表风险的方差。为缓解智能体承担的风险,我们要求在整个时间范围 $T$ 内,以至少 $1-\delta$ 的概率,智能体每次选择的物品集合中各物品方差之和不超过给定的方差预算。我们将此约束定义为“非保时风险限定”约束。在此约束下,我们设计并分析了算法 {\sc PASCombUCB},该算法可在时间范围 $T$ 内最小化遗憾值。通过建立相应的信息论下界,我们证明了在依赖问题与不依赖问题的两种范式下,{\sc PASCombUCB} 均达到近乎渐近最优性。实验验证了我们的理论发现。本文的问题设定、所提出的 {\sc PASCombUCB} 算法及其新颖分析方法可应用于推荐系统与交通等领域,这些领域中智能体允许在每个时间步选择多个物品,并期望控制整个时间范围内的风险。