Maximizing long-term rewards is the primary goal in sequential decision-making problems. The majority of existing methods assume that side information is freely available, enabling the learning agent to observe all features' states before making a decision. In real-world problems, however, collecting beneficial information is often costly. That implies that, besides individual arms' reward, learning the observations of the features' states is essential to improve the decision-making strategy. The problem is aggravated in a non-stationary environment where reward and cost distributions undergo abrupt changes over time. To address the aforementioned dual learning problem, we extend the contextual bandit setting and allow the agent to observe subsets of features' states. The objective is to maximize the long-term average gain, which is the difference between the accumulated rewards and the paid costs on average. Therefore, the agent faces a trade-off between minimizing the cost of information acquisition and possibly improving the decision-making process using the obtained information. To this end, we develop an algorithm that guarantees a sublinear regret in time. Numerical results demonstrate the superiority of our proposed policy in a real-world scenario.
翻译:最大化长期回报是序贯决策问题中的主要目标。现有大多数方法假设辅助信息可无偿获取,使得学习代理能在决策前观察所有特征状态。然而,在现实问题中,收集有益信息通常代价高昂。这意味着,除各臂的回报外,学习特征状态的观测对改进决策策略至关重要。在奖励与成本分布随时间发生突变的不稳定环境中,这一问题愈发严峻。为应对上述双重学习挑战,我们扩展了情境赌博机设置,允许代理观察特征状态的子集。目标是最大化长期平均增益,即累积奖励与平均支付成本之差。因此,代理需在最小化信息获取成本与利用所获信息优化决策过程之间权衡取舍。为此,我们提出一种保证时间次线性遗憾的算法。数值结果证明了所提策略在实际场景中的优越性。