The key challenge underlying machine learning is generalisation to new data. This work studies generalisation for datasets consisting of related tasks that may differ in causal mechanisms. For example, observational medical data for complex diseases suffers from heterogeneity in causal mechanisms of disease across patients, creating challenges for machine learning algorithms that need to generalise to new patients outside of the training dataset. Common approaches for learning supervised models with heterogeneous datasets include learning a global model for the entire dataset, learning local models for each tasks' data, or utilising hierarchical, meta-learning and multi-task learning approaches to learn how to generalise from data pooled across multiple tasks. In this paper we propose causal similarity-based hierarchical Bayesian models to improve generalisation to new tasks by learning how to pool data from training tasks with similar causal mechanisms. We apply this general modelling principle to Bayesian neural networks and compare a variety of methods for estimating causal task similarity (for both known and unknown causal models). We demonstrate the benefits of our approach and applicability to real world problems through a range of experiments on simulated and real data.
翻译:机器学习面临的核心挑战是对新数据的泛化能力。本研究针对由因果机制可能不同的相关任务组成的数据集,探讨泛化问题。例如,复杂疾病的观测医疗数据存在患者间疾病因果机制的异质性,这对需要泛化到训练数据集外新患者的机器学习算法构成了挑战。针对异质性数据集的监督模型学习方法通常包括:为整个数据集学习全局模型、为每个任务数据学习局部模型,或采用层次学习、元学习和多任务学习方法,从多任务聚合数据中学习泛化策略。本文提出基于因果相似性的层次贝叶斯模型,通过从具有相似因果机制的训练任务中学习数据聚合策略,提升对新任务的泛化能力。我们将这一通用建模原理应用于贝叶斯神经网络,并比较了多种估计因果任务相似性的方法(适用于已知和未知因果模型)。通过一系列模拟和真实数据实验,我们验证了该方法在实际问题中的优势与适用性。