We argue that the selective inclusion of data points based on latent objectives is common in practical situations, such as music sequences. Since this selection process often distorts statistical analysis, previous work primarily views it as a bias to be corrected and proposes various methods to mitigate its effect. However, while controlling this bias is crucial, selection also offers an opportunity to provide a deeper insight into the hidden generation process, as it is a fundamental mechanism underlying what we observe. In particular, overlooking selection in sequential data can lead to an incomplete or overcomplicated inductive bias in modeling, such as assuming a universal autoregressive structure for all dependencies. Therefore, rather than merely viewing it as a bias, we explore the causal structure of selection in sequential data to delve deeper into the complete causal process. Specifically, we show that selection structure is identifiable without any parametric assumptions or interventional experiments. Moreover, even in cases where selection variables coexist with latent confounders, we still establish the nonparametric identifiability under appropriate structural conditions. Meanwhile, we also propose a provably correct algorithm to detect and identify selection structures as well as other types of dependencies. The framework has been validated empirically on both synthetic data and real-world music.
翻译:我们认为,基于潜在目标对数据点进行选择性纳入在实际场景(如音乐序列)中普遍存在。由于这种选取过程常会扭曲统计分析,先前研究主要将其视为需要校正的偏差,并提出了多种方法来减轻其影响。然而,尽管控制这种偏差至关重要,但选取过程也为深入理解隐藏的生成机制提供了契机,因为它是我们观察现象背后的基本机制。特别是在序列数据中忽略选取效应,可能导致建模时产生不完整或过度复杂的归纳偏差,例如为所有依赖关系假设通用的自回归结构。因此,我们不仅将其视为偏差,更通过探索序列数据中选取的因果结构来深入研究完整的因果过程。具体而言,我们证明了无需任何参数假设或干预实验即可识别选取结构。此外,即使在选取变量与潜在混杂因素共存的情况下,我们仍在适当的结构条件下建立了非参数可识别性。同时,我们还提出了一种可证明正确的算法,用于检测和识别选取结构以及其他类型的依赖关系。该框架已在合成数据和真实世界音乐数据上得到实证验证。