Motivated by concerns about making online decisions that incur undue amount of risk at each time step, in this paper, we formulate the probably anytime-safe stochastic combinatorial semi-bandits problem. In this problem, the agent is given the option to select a subset of size at most $K$ from a set of $L$ ground items. Each item is associated to a certain mean reward as well as a variance that represents its risk. To mitigate the risk that the agent incurs, we require that with probability at least $1-\delta$, over the entire horizon of time $T$, each of the choices that the agent makes should contain items whose sum of variances does not exceed a certain variance budget. We call this probably anytime-safe constraint. Under this constraint, we design and analyze an algorithm {\sc PASCombUCB} that minimizes the regret over the horizon of time $T$. By developing accompanying information-theoretic lower bounds, we show under both the problem-dependent and problem-independent paradigms, {\sc PASCombUCB} is almost asymptotically optimal. Our problem setup, the proposed {\sc PASCombUCB} algorithm, and novel analyses are applicable to domains such as recommendation systems and transportation in which an agent is allowed to choose multiple items at a single time step and wishes to control the risk over the whole time horizon.
翻译:受限于对在线决策每一步可能产生过度风险的担忧,本文提出了"概率随时安全的随机组合半臂赌博机"问题。在该问题中,智能体可从包含$L$个基础物品的集合中选择至多$K$个物品的子集。每个物品关联着特定的均值收益及其代表风险程度的方差。为降低智能体面临的风险,我们要求在整个时间范围$T$内,以至少$1-\delta$的概率确保智能体每次选择的子集中所有物品的方差之和不超过指定的方差预算,此约束称为"概率随时安全约束"。在此约束下,我们设计并分析了一种名为{\sc PASCombUCB}的算法,该算法能够最小化时间范围$T$内的累积遗憾。通过推导相应的信息论下界,我们证明在问题依赖型和问题无关型两种范式下,{\sc PASCombUCB}算法均达到渐近最优。本文的问题设定、所提出的{\sc PASCombUCB}算法及其创新性分析方法可应用于推荐系统与交通等领域,这些领域中智能体常需在单次时间步中选择多个物品,且需在整个时间范围内控制风险。