Multi-task learning has emerged as a powerful machine learning paradigm for integrating data from multiple sources, leveraging similarities between tasks to improve overall model performance. However, the application of multi-task learning to real-world settings is hindered by data-sharing constraints, especially in healthcare settings. To address this challenge, we propose a flexible multi-task learning framework utilizing summary statistics from various sources. Additionally, we present an adaptive parameter selection approach based on a variant of Lepski's method, allowing for data-driven tuning parameter selection when only summary statistics are available. Our systematic non-asymptotic analysis characterizes the performance of the proposed methods under various regimes of the sample complexity and overlap. We demonstrate our theoretical findings and the performance of the method through extensive simulations. This work offers a more flexible tool for training related models across various domains, with practical implications in genetic risk prediction and many other fields.
翻译:多任务学习已成为一种强大的机器学习范式,能够整合来自多个来源的数据,利用任务间的相似性提升整体模型性能。然而,多任务学习在现实场景中的应用受到数据共享限制的阻碍,尤其在医疗领域。为应对这一挑战,我们提出了一种灵活的基于汇总统计的多任务学习框架,可整合来自不同来源的汇总统计数据。此外,我们提出了一种基于Lepski方法变体的自适应参数选择方法,使得在仅能获取汇总统计数据的场景下实现数据驱动的调参选择。我们的系统性非渐近分析刻画了所提方法在不同样本复杂度和重叠程度下的性能表现。通过大量仿真实验,我们验证了理论结果及方法的有效性。本研究为跨领域相关模型训练提供了更灵活的工具,在遗传风险预测及其他众多领域具有实际应用价值。