In this work the goal is to generalise to new data in a non-iid setting where datasets from related tasks are observed, each generated by a different causal mechanism, and the test dataset comes from the same task distribution. This setup is motivated by personalised medicine, where a patient is a task and complex diseases are heterogeneous across patients in cause and progression. The difficulty is that there usually is not enough data in one task to identify the causal mechanism, and unless the mechanisms are the same, pooling data across tasks, which meta-learning does one way or the other, may lead to worse predictors when the test setting may be uncontrollably different. In this paper we introduce to meta-learning, formulated as Bayesian hierarchical modelling, a proxy measure of similarity of the causal mechanisms of tasks, by learning a suitable embedding of the tasks from the whole data set. This embedding is used as auxiliary data for assessing which tasks should be pooled in the hierarchical model. We show that such pooling improves predictions in three health-related case studies, and by sensitivity analyses on simulated data that the method aids generalisability by utilising interventional data to identify tasks with similar causal mechanisms for pooling, even in limited data settings.
翻译:本研究旨在非独立同分布环境下实现对新数据的泛化,该场景中观测到来自相关任务的数据集,每个数据集由不同因果机制生成,且测试数据集来自相同的任务分布。该设定源于个性化医疗需求——患者个体即为一个任务,复杂疾病在病因和病程上呈现跨患者的异质性。其难点在于:单一任务通常数据量不足,难以识别因果机制;而若各机制存在差异,则元学习以某种形式聚合跨任务数据的方式,在测试环境可能发生不可控偏差时,反而会导致预测性能下降。本文在基于贝叶斯层级建模的元学习框架中,引入任务因果机制相似性的代理度量,通过从整体数据集中学习合适的任务嵌入表征。该嵌入作为辅助数据,用于评估层级模型中哪些任务应进行聚合。我们在三个健康相关案例研究中证明,此类聚合可提升预测效果;并通过仿真数据的敏感性分析表明,该方法即使在小样本设定下,仍能利用干预数据识别具有相似因果机制的任务进行聚合,从而增强模型泛化能力。