We address the synthesis of control policies for unknown discrete-time stochastic dynamical systems to satisfy temporal logic objectives. We present a data-driven, abstraction-based control framework that integrates online learning with novel incremental game-solving. Under appropriate continuity assumptions, our method abstracts the system dynamics into a finite stochastic (2.5-player) game graph derived from data. Given a requirement over time on this graph, we compute the winning region -- i.e., the set of initial states from which the objective is satisfiable -- in the resulting game, together with a corresponding control policy. Our main contribution is the construction of abstractions, winning regions and control policies incrementally, as data about the system dynamics accumulates. Concretely, our algorithm refines under- and over-approximations of reachable sets for each state-action pair as new data samples arrive. These refinements induce structural modifications in the game graph abstraction -- such as the addition or removal of nodes and edges -- which in turn modify the winning region. Crucially, we show that these updates are inherently monotonic: under-approximations can only grow, over-approximations can only shrink, and the winning region can only expand. We exploit this monotonicity by defining an objective-induced ranking function on the nodes of the abstract game that increases monotonically as new data samples are incorporated. These ranks underpin our novel incremental game-solving algorithm, which employs customized gadgets (DAG-like subgames) within a rank-lifting algorithm to efficiently update the winning region. Numerical case studies demonstrate significant computational savings compared to the baseline approach, which resolves the entire game from scratch whenever new data samples arrive.


翻译:本文研究针对未知离散时间随机动态系统满足时序逻辑目标的控制策略综合问题。我们提出一种基于数据驱动的抽象化控制框架,将在线学习与新颖的增量式博弈求解相结合。在适当的连续性假设下,该方法将系统动态抽象为从数据导出的有限随机(2.5人)博弈图。给定该图上的时序需求,我们在生成的博弈中计算获胜区域(即满足目标的初始状态集合)及相应的控制策略。主要贡献在于随着系统动态数据的积累,以增量方式构建抽象模型、获胜区域和控制策略。具体而言,算法在接收到新数据样本时,针对每个状态-动作对的可达集进行下近似和上近似的精细化修正。这些修正会引发博弈图抽象的结构性修改(如节点与边的增删),进而改变获胜区域。关键的是,我们证明这些更新具有内在单调性:下近似只能扩大,上近似只能缩小,而获胜区域只能扩展。我们通过定义抽象博弈节点上随新数据样本加入单调递增的目标诱导排序函数来利用该单调性。这些排序值支撑着新颖的增量式博弈求解算法,该算法在排序提升算法中采用定制化组件(DAG状子博弈)以高效更新获胜区域。数值案例研究表明,相较于每当新数据到达时需从头求解整个博弈的基线方法,本方法能显著节省计算资源。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员