Conformal Off-Policy Prediction for Multi-Agent Systems

Off-Policy Prediction (OPP), i.e., predicting the outcomes of a target policy using only data collected under a nominal (behavioural) policy, is a paramount problem in data-driven analysis of safety-critical systems where the deployment of a new policy may be unsafe. To achieve dependable off-policy predictions, recent work on Conformal Off-Policy Prediction (COPP) leverage the conformal prediction framework to derive prediction regions with probabilistic guarantees under the target process. Existing COPP methods can account for the distribution shifts induced by policy switching, but are limited to single-agent systems and scalar outcomes (e.g., rewards). In this work, we introduce MA-COPP, the first conformal prediction method to solve OPP problems involving multi-agent systems, deriving joint prediction regions for all agents' trajectories when one or more "ego" agents change their policies. Unlike the single-agent scenario, this setting introduces higher complexity as the distribution shifts affect predictions for all agents, not just the ego agents, and the prediction task involves full multi-dimensional trajectories, not just reward values. A key contribution of MA-COPP is to avoid enumeration or exhaustive search of the output space of agent trajectories, which is instead required by existing COPP methods to construct the prediction region. We achieve this by showing that an over-approximation of the true JPR can be constructed, without enumeration, from the maximum density ratio of the JPR trajectories. We evaluate the effectiveness of MA-COPP in multi-agent systems from the PettingZoo library and the F1TENTH autonomous racing environment, achieving nominal coverage in higher dimensions and various shift settings.

翻译：离策略预测（Off-Policy Prediction, OPP）是指仅利用名义（行为）策略下收集的数据来预测目标策略结果，这在安全关键系统的数据驱动分析中至关重要，因为部署新策略可能不安全。为实现可靠的离策略预测，近期关于保形离策略预测（COPP）的研究利用保形预测框架推导出在目标过程中具有概率保证的预测区域。现有COPP方法能够应对策略切换引发的分布偏移，但仅限于单智能体系统和标量结果（如奖励值）。本文提出首个解决多智能体系统OPP问题的保形预测方法MA-COPP，该方法在多个"自我"智能体改变策略时，能为所有智能体的轨迹推导联合预测区域。与单智能体场景不同，本场景具有更高复杂性：分布偏移不仅影响自我智能体，还影响所有智能体的预测，且预测任务涉及完整多维轨迹而非仅奖励值。MA-COPP的核心贡献在于避免对智能体轨迹的输出空间进行枚举或穷举搜索（这是现有COPP方法构建预测区域的必要步骤）。我们通过证明可从联合预测区域轨迹的最大密度比出发，无需枚举即可构建真实联合预测区域的过近似来实现这一目标。我们在PettingZoo库和F1TENTH自主赛车环境中的多智能体系统上验证了MA-COPP的有效性，在高维度和多种偏移设置下均实现了名义覆盖。