Many resource management problems require sequential decision-making under uncertainty, where the only uncertainty affecting the decision outcomes are exogenous variables outside the control of the decision-maker. We model these problems as Exo-MDPs (Markov Decision Processes with Exogenous Inputs) and design a class of data-efficient algorithms for them termed Hindsight Learning (HL). Our HL algorithms achieve data efficiency by leveraging a key insight: having samples of the exogenous variables, past decisions can be revisited in hindsight to infer counterfactual consequences that can accelerate policy improvements. We compare HL against classic baselines in the multi-secretary and airline revenue management problems. We also scale our algorithms to a business-critical cloud resource management problem -- allocating Virtual Machines (VMs) to physical machines, and simulate their performance with real datasets from a large public cloud provider. We find that HL algorithms outperform domain-specific heuristics, as well as state-of-the-art reinforcement learning methods.
翻译:许多资源管理问题需要在不确定性条件下进行序贯决策,而影响决策结果的唯一不确定性源于决策者无法控制的外部变量。我们将此类问题建模为Exo-MDP(具有外部输入的马尔可夫决策过程),并为其设计一类称为后见学习(HL)的数据高效算法。我们的HL算法通过利用一个关键洞察实现数据效率:在获得外部变量样本后,可以通过后见回溯过往决策,推断反事实结果,从而加速策略改进。我们在多秘书和航空公司收益管理问题中将HL与经典基线方法进行比较。我们还将算法扩展到一个关键业务云资源管理问题——将虚拟机(VM)分配到物理机,并使用来自大型公有云提供商的真实数据集模拟其性能。研究发现,HL算法不仅优于领域特定启发式方法,也超越了最先进的强化学习方法。