Many resource management problems require sequential decision-making under uncertainty, where the only uncertainty affecting the decision outcomes are exogenous variables outside the control of the decision-maker. We model these problems as Exo-MDPs (Markov Decision Processes with Exogenous Inputs) and design a class of data-efficient algorithms for them termed Hindsight Learning (HL). Our HL algorithms achieve data efficiency by leveraging a key insight: having samples of the exogenous variables, past decisions can be revisited in hindsight to infer counterfactual consequences that can accelerate policy improvements. We compare HL against classic baselines in the multi-secretary and airline revenue management problems. We also scale our algorithms to a business-critical cloud resource management problem -- allocating Virtual Machines (VMs) to physical machines, and simulate their performance with real datasets from a large public cloud provider. We find that HL algorithms outperform domain-specific heuristics, as well as state-of-the-art reinforcement learning methods.
翻译:许多资源管理问题需要在不确定性下进行序贯决策,而影响决策结果的唯一不确定性是决策者无法控制的外生变量。我们将这些问题建模为Exo-MDP(具有外生输入的马尔可夫决策过程),并为其设计了一类数据高效算法,称为回溯学习(HL)。我们的HL算法通过利用一个关键洞察实现数据效率:在拥有外生变量样本的情况下,可以回溯地重新审视过去的决策,推断反事实结果,从而加速策略改进。我们在多秘书和航空收益管理问题中将HL与经典基线方法进行了比较。我们还将算法扩展到一个关键的商业云资源管理问题——将虚拟机(VM)分配到物理机器上,并使用来自大型公共云提供商的真实数据集模拟其性能。我们发现HL算法优于特定领域的启发式方法以及最先进的强化学习方法。