We consider an online two-stage stochastic optimization with long-term constraints over a finite horizon of $T$ periods. At each period, we take the first-stage action, observe a model parameter realization and then take the second-stage action from a feasible set that depends both on the first-stage decision and the model parameter. We aim to minimize the cumulative objective value while guaranteeing that the long-term average second-stage decision belongs to a set. We propose a general algorithmic framework that derives online algorithms for the online two-stage problem from adversarial learning algorithms. Also, the regret bound of our algorithm cam be reduced to the regret bound of embedded adversarial learning algorithms. Based on our framework, we obtain new results under various settings. When the model parameter at each period is drawn from identical distributions, we derive state-of-art regret bound that improves previous bounds under special cases. Our algorithm is also robust to adversarial corruptions of model parameter realizations. When the model parameters are drawn from unknown non-stationary distributions and we are given prior estimates of the distributions, we develop a new algorithm from our framework with a regret $O(W_T+\sqrt{T})$, where $W_T$ measures the total inaccuracy of the prior estimates.
翻译:我们考虑一个有限时间跨度$T$内具有长期约束的在线两阶段随机优化问题。在每个阶段,我们先采取第一阶段行动,观察模型参数实现,然后从依赖于第一阶段决策和模型参数的可行集中采取第二阶段行动。我们的目标是最小化累积目标值,同时确保长期平均的第二阶段决策属于某个集合。我们提出了一种通用算法框架,该框架通过对抗学习算法为在线两阶段问题推导在线算法。此外,我们算法的遗憾界可以归约为嵌入的对抗学习算法的遗憾界。基于这一框架,我们在多种设定下获得了新结果。当每个阶段的模型参数来自同一分布时,我们推导出最前沿的遗憾界,在特例下优于先前的界。该算法对模型参数实现的对抗性干扰也具有鲁棒性。当模型参数来自未知的非平稳分布,且我们拥有分布的先验估计时,我们从框架中开发了一个新算法,其遗憾为$O(W_T+\sqrt{T})$,其中$W_T$衡量先验估计的总不准确性。