We often collect data from multiple sites (e.g., hospitals) that share common structure but also exhibit heterogeneity. This paper aims to learn robust sequential decision-making policies from such offline, multi-site datasets. To model cross-site uncertainty, we study distributionally robust MDPs with a group-linear structure: all sites share a common feature map, and both the transition kernels and expected reward functions are linear in these shared features. We introduce feature-wise (d-rectangular) uncertainty sets, which preserve tractable robust Bellman recursions while maintaining key cross-site structure. Building on this, we then develop an offline algorithm based on pessimistic value iteration that includes: (i) per-site ridge regression for Bellman targets, (ii) feature-wise worst-case (row-wise minimization) aggregation, and (iii) a data-dependent pessimism penalty computed from the diagonals of the inverse design matrices. We further propose a cluster-level extension that pools similar sites to improve sample efficiency, guided by prior knowledge of site similarity. Under a robust partial coverage assumption, we prove a suboptimality bound for the resulting policy. Overall, our framework addresses multi-site learning with heterogeneous data sources and provides a principled approach to robust planning without relying on strong state-action rectangularity assumptions.
翻译:我们通常从多个站点(例如医院)收集数据,这些站点具有共同的结构但也表现出异质性。本文旨在从这类离线、多站点的数据集中学习鲁棒的序列决策策略。为建模跨站点不确定性,我们研究具有组线性结构的分布鲁棒马尔可夫决策过程:所有站点共享一个共同的特征映射,且转移核与期望奖励函数在这些共享特征上均为线性。我们引入了特征维度(d-矩形)不确定性集合,该集合在保持关键跨站点结构的同时,保留了可处理的鲁棒贝尔曼递归。在此基础上,我们进一步开发了一种基于悲观值迭代的离线算法,其包含:(i)针对贝尔曼目标的逐站点岭回归,(ii)特征维度最坏情况(行最小化)聚合,以及(iii)基于逆设计矩阵对角线计算的数据依赖型悲观惩罚项。我们还提出了一种聚类级扩展方法,在站点相似性先验知识的指导下,通过合并相似站点以提高样本效率。在鲁棒部分覆盖假设下,我们证明了所得策略的次优性界。总体而言,我们的框架解决了具有异质数据源的多站点学习问题,并提供了一种无需依赖强状态-动作矩形性假设的鲁棒规划原则性方法。