Battery energy storage systems are increasingly deployed as fast-responding resources for grid balancing services such as frequency regulation and for mitigating renewable generation uncertainty. However, repeated charging and discharging induces cycling degradation and reduces battery lifetime. This paper studies the real-time scheduling of a heterogeneous battery fleet that collectively tracks a stochastic balancing signal subject to per-battery ramp-rate and capacity constraints, while minimizing long-term cycling degradation. Cycling degradation is fundamentally path-dependent: it is determined by charge-discharge cycles formed by the state-of-charge (SoC) trajectory and is commonly quantified via rainflow cycle counting. This non-Markovian structure makes it difficult to express degradation as an additive per-time-step cost, complicating classical dynamic programming approaches. We address this challenge by formulating the fleet scheduling problem as a Markov decision process (MDP) with constrained action space and designing a dense proxy reward that provides informative feedback at each time step while remaining aligned with long-term cycle-depth reduction. To scale learning to large state-action spaces induced by fine-grained SoC discretization and asymmetric per-battery constraints, we develop a function-approximation reinforcement learning method using an Extreme Learning Machine (ELM) as a random nonlinear feature map combined with linear temporal-difference learning. We evaluate the proposed approach on a toy Markovian signal model and on a Markovian model trained from real-world regulation signal traces obtained from the University of Delaware, and demonstrate consistent reductions in cycle-depth occurrence and degradation metrics compared to baseline scheduling policies.
翻译:电池储能系统正日益作为快速响应资源,用于电网平衡服务(如频率调节)以及缓解可再生能源发电的不确定性。然而,反复充放电会引发循环退化并缩短电池寿命。本文研究异构电池组的实时调度问题,该电池组需在满足各电池爬坡率与容量约束的条件下,共同跟踪一个随机平衡信号,同时最小化长期循环退化。循环退化本质上是路径依赖的:它由荷电状态轨迹形成的充放电循环决定,通常通过雨流计数法进行量化。这种非马尔可夫结构使得难以将退化表达为按时间步累加的代价,从而复杂化了经典的动态规划方法。为解决这一挑战,我们将电池组调度问题表述为一个具有约束动作空间的马尔可夫决策过程,并设计了一种密集的代理奖励函数,该函数在每一步都能提供信息丰富的反馈,同时与长期循环深度减少的目标保持一致。为了将学习扩展到由细粒度荷电状态离散化和非对称的每电池约束所引发的大规模状态-动作空间,我们开发了一种函数逼近强化学习方法,该方法使用极限学习机作为随机非线性特征映射,并结合线性时序差分学习。我们在一个玩具马尔可夫信号模型以及一个基于从特拉华大学获取的真实世界调节信号轨迹训练得到的马尔可夫模型上评估了所提出的方法,结果表明,与基线调度策略相比,该方法在循环深度出现次数和退化指标上均实现了持续的降低。