This work extends the theory of identifiability in supervised learning by considering the consequences of having access to a distribution of tasks. In such cases, we show that linear identifiability is achievable in the general multi-task regression setting. Furthermore, we show that the existence of a task distribution which defines a conditional prior over latent factors reduces the equivalence class for identifiability to permutations and scaling of the true latent factors, a stronger and more useful result than linear identifiability. Crucially, when we further assume a causal structure over these tasks, our approach enables simple maximum marginal likelihood optimization, and suggests potential downstream applications to causal representation learning. Empirically, we find that this straightforward optimization procedure enables our model to outperform more general unsupervised models in recovering canonical representations for both synthetic data and real-world molecular data.
翻译:本研究通过考虑任务分布可访问性的影响,扩展了监督学习中可辨识性理论。我们证明,在此类情况下,线性可辨识性可在通用多任务回归场景中实现。进一步研究表明,定义潜在因子条件先验的任务分布存在时,可辨识性的等价类可简化为真实潜在因子的置换与缩放变换——这较线性可辨识性具有更强理论意义与实用价值。尤为关键的是,当进一步假设任务间存在因果结构时,本方法支持简单的最大边际似然优化,并为因果表示学习的下游应用提供了潜在路径。实证研究表明,这种直接优化策略使我们的模型在合成数据与真实世界分子数据中恢复规范表示时,能够超越更通用的无监督模型。