Digital twin modeling, including control and data assimilation under model uncertainty, often faces an open-ended fidelity problem: adding variables, data streams, and time scales can indefinitely increase model complexity, ultimately producing systems that are difficult to maintain, validate, interpret, and use for stress or safety testing. As an alternative, one can seek parsimonious stochastic surrogate models built only on the variables needed to describe the relevant quantities of interest. We introduce a framework for discovering such variables from observational data by identifying which candidate inputs influence the full conditional law of a target quantity, rather than only its conditional mean. This distinction is essential in stochastic, coarse-grained, or partially observed systems, where dependencies may appear through changes in variability, tail behavior, multimodality, or uncertainty rather than through deterministic functional relationships. The framework couples conditional generative modeling, which learns the conditional distribution of the target given candidate inputs, with Gaussian-process-based analysis of variance (through kernel mode decomposition), which enables iterative pruning of non-influential inputs and interpretable structure discovery. In control settings, the resulting surrogate can be interpreted as a learned Markov decision process: the method identifies not only a transition model, but also the state, action, and memory variables needed to make the learned dynamics effectively Markovian. Across examples involving stochastic dynamical systems, missing variables, PDE control, reinforcement learning, and economic data, the discovered structures yield interpretable stochastic surrogates whose downstream performance is comparable to models trained on the full variable set.
翻译:数字孪生建模(包括模型不确定性下的控制与数据同化)常面临开放性保真度问题:添加变量、数据流和时间尺度会无限增加模型复杂度,最终导致系统难以维护、验证、解释及用于压力或安全测试。作为替代方案,可构建仅基于描述相关感兴趣量所需变量的简约随机替代模型。本文提出一个框架,通过识别哪些候选输入影响目标量的全条件分布(而非仅其条件均值),从观测数据中发现此类变量。这一区分在随机、粗粒化或部分可观测系统中至关重要——此类系统中依赖关系可能通过变异性、尾部行为、多模态或不确定性的变化呈现,而非通过确定性函数关系。该框架将条件生成建模(学习目标量在给定候选输入下的条件分布)与基于高斯过程的方差分析(通过核模态分解)相结合,实现对非影响性输入的迭代剪枝与可解释结构发现。在控制场景中,所得替代模型可解释为学习到的马尔可夫决策过程:该方法不仅能识别转移模型,还能识别使学习动力学有效满足马尔可夫性所需的状态、动作与记忆变量。在涉及随机动力系统、缺失变量、偏微分方程控制、强化学习及经济数据的多个示例中,所发现的结构产生了可解释的随机替代模型,其下游性能与基于完整变量集训练的模型相当。