Non-Stationary Online Resource Allocation: Learning from a Single Sample

We study online resource allocation under non-stationary demand with a minimum offline data requirement. In this problem, a decision-maker must allocate multiple types of resources to sequentially arriving queries over a finite horizon. Each query belongs to a finite set of types with fixed resource consumption and a stochastic reward drawn from an unknown, type-specific distribution. Critically, the environment exhibits arbitrary non-stationarity -- arrival distributions may shift unpredictably-while the algorithm requires only one historical sample per period to operate effectively. We distinguish two settings based on sample informativeness: (i) reward-observed samples containing both query type and reward realization, and (ii) the more challenging type-only samples revealing only query type information. We propose a novel type-dependent quantile-based meta-policy that decouples the problem into modular components: reward distribution estimation, optimization of target service probabilities via fluid relaxation, and real-time decisions through dynamic acceptance thresholds. For reward-observed samples, our static threshold policy achieves $\tilde{O}(\sqrt{T})$ regret. For type-only samples, we first establish that sublinear regret is impossible without additional structure; under a mild minimum-arrival-probability assumption, we design both a partially adaptive policy attaining the same $\tilde{O}({T})$ bound and, more significantly, a fully adaptive resolving policy with careful rounding that achieves the first poly-logarithmic regret guarantee of $O((\log T)^3)$ for non-stationary multi-resource allocation. Our framework advances prior work by operating with minimal offline data (one sample per period), handling arbitrary non-stationarity without variation-budget assumptions, and supporting multiple resource constraints.

翻译：本研究探讨了在非平稳需求下、仅需最少离线数据要求的在线资源分配问题。在该问题中，决策者必须在有限时间范围内，将多种类型的资源分配给顺序到达的查询请求。每个查询属于一个有限的类型集合，具有固定的资源消耗，其随机奖励则来自一个未知的、类型特定的分布。关键之处在于，环境表现出任意的非平稳性——到达分布可能发生不可预测的漂移——而算法仅需每个时期一个历史样本即可有效运行。我们根据样本的信息含量区分了两种设定：(i) 奖励可观测样本，包含查询类型和奖励实现值；(ii) 更具挑战性的仅类型样本，仅揭示查询类型信息。我们提出了一种新颖的、基于类型依赖分位数的元策略，将问题解耦为模块化组件：奖励分布估计、通过流体松弛优化目标服务概率，以及通过动态接受阈值进行实时决策。对于奖励可观测样本，我们的静态阈值策略实现了 $\tilde{O}(\sqrt{T})$ 的遗憾。对于仅类型样本，我们首先证明在没有额外结构的情况下，次线性遗憾是不可能的；在一个温和的最小到达概率假设下，我们设计了一个部分自适应策略，达到了相同的 $\tilde{O}({T})$ 界，并且更重要的是，设计了一个具有精心舍入操作的完全自适应解析策略，首次为非平稳多资源分配问题实现了 $O((\log T)^3)$ 的多对数遗憾保证。我们的框架通过以最少的离线数据（每时期一个样本）运行、无需变化预算假设即可处理任意非平稳性，并支持多重资源约束，从而推进了先前的工作。