Allocation tasks represent a class of problems where a limited amount of resources must be allocated to a set of entities at each time step. Prominent examples of this task include portfolio optimization or distributing computational workloads across servers. Allocation tasks are typically bound by linear constraints describing practical requirements that have to be strictly fulfilled at all times. In portfolio optimization, for example, investors may be obligated to allocate less than 30\% of the funds into a certain industrial sector in any investment period. Such constraints restrict the action space of allowed allocations in intricate ways, which makes learning a policy that avoids constraint violations difficult. In this paper, we propose a new method for constrained allocation tasks based on an autoregressive process to sequentially sample allocations for each entity. In addition, we introduce a novel de-biasing mechanism to counter the initial bias caused by sequential sampling. We demonstrate the superior performance of our approach compared to a variety of Constrained Reinforcement Learning (CRL) methods on three distinct constrained allocation tasks: portfolio optimization, computational workload distribution, and a synthetic allocation benchmark. Our code is available at: https://github.com/niklasdbs/paspo
翻译:分配任务代表了一类问题,其中在每个时间步必须将有限数量的资源分配给一组实体。这类任务的典型例子包括投资组合优化或在服务器间分配计算工作负载。分配任务通常受到线性约束的制约,这些约束描述了必须始终严格满足的实际要求。例如在投资组合优化中,投资者可能被强制要求在任何投资期间将少于30%的资金分配到特定行业板块。此类约束以复杂的方式限制了允许分配的行动空间,这使得学习避免约束违反的策略变得困难。本文提出了一种基于自回归过程的新方法,用于约束分配任务,通过顺序采样为每个实体生成分配方案。此外,我们引入了一种新颖的去偏机制,以抵消顺序采样引起的初始偏差。我们在三个不同的约束分配任务上展示了本方法相较于多种约束强化学习(CRL)方法的优越性能:投资组合优化、计算工作负载分配以及合成分配基准测试。我们的代码发布于:https://github.com/niklasdbs/paspo