In this paper, we consider the problem of learning online to manage Demand Response (DR) resources. A typical DR mechanism requires the DR manager to assign a baseline to the participating consumer, where the baseline is an estimate of the counterfactual consumption of the consumer had it not been called to provide the DR service. A challenge in estimating baseline is the incentive the consumer has to inflate the baseline estimate. We consider the problem of learning online to estimate the baseline and to optimize the operating costs over a period of time under such incentives. We propose an online learning scheme that employs least-squares for estimation with a perturbation to the reward price (for the DR services or load curtailment) that is designed to balance the exploration and exploitation trade-off that arises with online learning. We show that, our proposed scheme is able to achieve a very low regret of $\mathcal{O}\left((\log{T})^2\right)$ with respect to the optimal operating cost over $T$ days of the DR program with full knowledge of the baseline, and is individually rational for the consumers to participate. Our scheme is significantly better than the averaging type approach, which only fetches $\mathcal{O}(T^{1/3})$ regret.
翻译:本文研究了需求响应资源管理的在线学习问题。典型的需求响应机制要求管理者为参与用户设定基准线,该基准线是对用户未参与需求响应服务时反事实用电量的估计。基准线估计面临的关键挑战在于,用户存在人为抬高基准线估计值以获取更多收益的激励动机。我们探讨了在此类激励环境下,通过在线学习实现基准线估计与长期运营成本优化的联合问题。本文提出一种在线学习方案,采用最小二乘法进行估计,并对需求服务或负荷削减的奖励价格施加扰动,以平衡在线学习中探索与利用的权衡关系。理论证明,该方案相对于具有完全基准线信息的最优运营成本,能在T天的需求响应周期内实现极低的遗憾值$\mathcal{O}\left((\log{T})^2\right)$,且满足用户参与的个体理性约束。该方案显著优于仅能达到$\mathcal{O}(T^{1/3})$遗憾值的平均化方法。