The Prophet Inequality and Pandora's Box problems are fundamental stochastic problem with applications in Mechanism Design, Online Algorithms, Stochastic Optimization, Optimal Stopping, and Operations Research. A usual assumption in these works is that the probability distributions of the $n$ underlying random variables are given as input to the algorithm. Since in practice these distributions need to be learned, we initiate the study of such stochastic problems in the Multi-Armed Bandits model. In the Multi-Armed Bandits model we interact with $n$ unknown distributions over $T$ rounds: in round $t$ we play a policy $x^{(t)}$ and receive a partial (bandit) feedback on the performance of $x^{(t)}$. The goal is to minimize the regret, which is the difference over $T$ rounds in the total value of the optimal algorithm that knows the distributions vs. the total value of our algorithm that learns the distributions from the partial feedback. Our main results give near-optimal $\tilde{O}(\mathsf{poly}(n)\sqrt{T})$ total regret algorithms for both Prophet Inequality and Pandora's Box. Our proofs proceed by maintaining confidence intervals on the unknown indices of the optimal policy. The exploration-exploitation tradeoff prevents us from directly refining these confidence intervals, so the main technique is to design a regret upper bound that is learnable while playing low-regret Bandit policies.
翻译:先知不等式和潘多拉魔盒问题是机制设计、在线算法、随机优化、最优停止及运筹学等领域中的基础随机问题。这类问题的常规假设是算法输入中已经包含$n$个底层随机变量的概率分布。鉴于实践中这些分布需要学习,我们率先在多臂赌博机模型框架下研究此类随机问题。在多臂赌博机模型中,我们在$T$轮迭代中与$n$个未知分布交互:第$t$轮执行策略$x^{(t)}$并接收关于该策略性能的部分(赌博机)反馈。目标是最小化遗憾值,即已知分布的最优算法在$T$轮中的总价值与我们通过部分反馈学习分布的算法所得总价值之差。我们的主要成果是给出了针对先知不等式和潘多拉魔盒问题的近最优$\tilde{O}(\mathsf{poly}(n)\sqrt{T})$总遗憾算法。证明过程通过维护最优策略未知指标的置信区间展开。由于探索与利用的权衡阻碍我们直接优化这些置信区间,核心技术是设计一个可在执行低遗憾赌博机策略时实现可学习性上界的遗憾上界。