Resource Allocation in Multi-armed Bandit Exploration: Overcoming Sublinear Scaling with Adaptive Parallelism

We study exploration in stochastic multi-armed bandits when we have access to a divisible resource that can be allocated in varying amounts to arm pulls. We focus in particular on the allocation of distributed computing resources, where we may obtain results faster by allocating more resources per pull, but might have reduced throughput due to nonlinear scaling. For example, in simulation-based scientific studies, an expensive simulation can be sped up by running it on multiple cores. This speed-up however, is partly offset by the communication among cores, which results in lower throughput than if fewer cores were allocated per trial to run more trials in parallel. In this paper, we explore these trade-offs in two settings. First, in a fixed confidence setting, we need to find the best arm with a given target success probability as quickly as possible. We propose an algorithm which trades off between information accumulation and throughput and show that the time taken can be upper bounded by the solution of a dynamic program whose inputs are the gaps between the sub-optimal and optimal arms. We also prove a matching hardness result. Second, we present an algorithm for a fixed deadline setting, where we are given a time deadline and need to maximize the probability of finding the best arm. We corroborate our theoretical insights with simulation experiments that show that the algorithms consistently match or outperform baseline algorithms on a variety of problem instances.

翻译：当我们有机会获得可以分配到不同数量的分散资源时,我们研究多武装强盗的探索。我们特别侧重于分配分布式计算资源的分配,我们可以通过每个拉动分配更多的资源更快地获得结果,但可能由于非线性规模的扩大而减少吞吐量。例如,在基于模拟的科学研究中,一个昂贵的模拟可以通过在多个核心上运行而加速。然而,这种加速被核心之间的交流所部分地抵消,这种交流导致比每次试验分配到的要更多试验的核心数量要少的通过量要低。在本文件中,我们探索两种情况下的权衡。首先,在固定的信任环境下,我们需要找到最好的手臂,尽可能快地以特定的目标成功概率找到最好的手臂。我们建议一种算法,在信息积累和吞吐量之间进行交易,并表明所花费的时间可以被动态程序的解决办法的上限所限制,其投入是次优和最佳武器之间的差距。我们还证明了一个匹配的硬性结果。第二,我们提出一个在两种情况下,我们提出一个在固定的信心环境中进行一个算算法,一个固定的精确的逻辑模型,从而显示我们最有可能找到最接近的最后期限的精确的逻辑分析。