We study the problems of offline and online contextual optimization with feedback information, where instead of observing the loss, we observe, after-the-fact, the optimal action an oracle with full knowledge of the objective function would have taken. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an all-knowing oracle. In the offline setting, the decision-maker has information available from past periods and needs to make one decision, while in the online setting, the decision-maker optimizes decisions dynamically over time based a new set of feasible actions and contextual functions in each period. For the offline setting, we characterize the optimal minimax policy, establishing the performance that can be achieved as a function of the underlying geometry of the information induced by the data. In the online setting, we leverage this geometric characterization to optimize the cumulative regret. We develop an algorithm that yields the first regret bound for this problem that is logarithmic in the time horizon. Finally, we show via simulation that our proposed algorithms outperform previous methods from the literature.
翻译:我们研究带有反馈信息的离线与在线情境优化问题。在此类问题中,观察点并非直接获取损失值,而是在事后获取一个完全知晓目标函数的预言机本应选择的最优行动。我们的目标是最小化遗憾值,该遗憾定义为我们的损失与全知预言机所承受损失之间的差值。在离线设定下,决策者可利用历史时期的信息,仅需做出一次决策;而在在线设定下,决策者需要基于每个时期新出现的可行行动集合和情境函数,随时间动态优化决策。对于离线设定,我们刻画了最优极小化极大策略,确定了基于数据所引发信息内在几何结构可实现的性能。针对在线设定,我们利用这一几何特征来优化累积遗憾。我们开发了一种算法,该算法首次实现了针对该问题的时间对数阶遗憾界。最后,通过仿真实验证明,我们提出的算法优于文献中已有的方法。