Given only observational data $X = g(Z)$, where both the latent variables $Z$ and the generating process $g$ are unknown, recovering $Z$ is ill-posed without additional assumptions. Existing methods often assume linearity or rely on auxiliary supervision and functional constraints. However, such assumptions are rarely verifiable in practice, and most theoretical guarantees break down under even mild violations, leaving uncertainty about how to reliably understand the hidden world. To make identifiability actionable in the real-world scenarios, we take a complementary view: in the general settings where full identifiability is unattainable, what can still be recovered with guarantees, and what biases could be universally adopted? We introduce the problem of diverse dictionary learning to formalize this view. Specifically, we show that intersections, complements, and symmetric differences of latent variables linked to arbitrary observations, along with the latent-to-observed dependency structure, are still identifiable up to appropriate indeterminacies even without strong assumptions. These set-theoretic results can be composed using set algebra to construct structured and essential views of the hidden world, such as genus-differentia definitions. When sufficient structural diversity is present, they further imply full identifiability of all latent variables. Notably, all identifiability benefits follow from a simple inductive bias during estimation that can be readily integrated into most models. We validate the theory and demonstrate the benefits of the bias on both synthetic and real-world data.
翻译:仅依赖观测数据 $X = g(Z)$(其中隐变量 $Z$ 与生成过程 $g$ 均未知)时,若无额外假设,恢复 $Z$ 具有病态性。现有方法常假设线性关系,或依赖于辅助监督与函数约束。然而此类假设在实践中难以验证,且即便轻微偏离假设,多数理论保证也会失效,从而难以可靠理解隐藏世界。为使可识别性在真实场景中具备可操作性,本文提出互补视角:在完全可识别性无法实现的通用设定下,哪些内容仍能通过可保证的方式恢复?哪些偏差可被普遍采用?我们引入"多样字典学习"问题以规范该视角。具体而言,我们证明:在无强假设条件下,与任意观测关联的隐变量的交集、补集与对称差,以及隐变量-观测依赖结构,仍可在适当不确定性下保持可识别性。这些集合论结果可通过集合代数进行组合,以构建隐藏世界的结构化本质视图(例如属加种差定义)。当存在足够的结构多样性时,该结果进一步蕴含所有隐变量的完全可识别性。值得注意的是,所有可识别性优势均源于估计过程中可便捷集成到多数模型的简单归纳偏置。我们通过合成数据与真实数据验证了该理论,并展示了该偏置的实际效益。