To leverage the copious amount of data from source tasks and overcome the scarcity of the target task samples, representation learning based on multi-task pretraining has become a standard approach in many applications. However, up until now, most existing works design a source task selection strategy from a purely empirical perspective. Recently, \citet{chen2022active} gave the first active multi-task representation learning (A-MTRL) algorithm which adaptively samples from source tasks and can provably reduce the total sample complexity using the L2-regularized-target-source-relevance parameter $\nu^2$. But their work is theoretically suboptimal in terms of total source sample complexity and is less practical in some real-world scenarios where sparse training source task selection is desired. In this paper, we address both issues. Specifically, we show the strict dominance of the L1-regularized-relevance-based ($\nu^1$-based) strategy by giving a lower bound for the $\nu^2$-based strategy. When $\nu^1$ is unknown, we propose a practical algorithm that uses the LASSO program to estimate $\nu^1$. Our algorithm successfully recovers the optimal result in the known case. In addition to our sample complexity results, we also characterize the potential of our $\nu^1$-based strategy in sample-cost-sensitive settings. Finally, we provide experiments on real-world computer vision datasets to illustrate the effectiveness of our proposed method.
翻译:为利用源任务的海量数据并克服目标任务样本稀缺的问题,基于多任务预训练的表示学习已成为许多应用中的标准方法。然而,迄今为止,现有工作大多从纯经验角度设计源任务选择策略。近期,Chen等人(2022)首次提出主动多任务表示学习算法,该算法自适应地从源任务采样,并可通过L2正则化目标-源相关性参数ν²可证明地降低总样本复杂度。但该工作在源样本复杂度理论上存在次优性,且在需要稀疏训练源任务选择的实际场景中缺乏实用性。本文针对这两个问题展开研究。具体而言,我们通过给出ν²策略的下界,证明了基于L1正则化相关性(ν¹策略)具有严格优势。针对ν¹未知的情况,我们提出使用LASSO程序估计ν¹的实用算法,该算法在已知情况下成功恢复最优结果。除样本复杂度结果外,我们还刻画了ν¹策略在样本代价敏感场景中的潜力。最后,我们在真实计算机视觉数据集上开展实验,验证了所提方法的有效性。