We study the problem of transfer-learning in the setting of stochastic linear bandit tasks. We consider that a low dimensional linear representation is shared across the tasks, and study the benefit of learning this representation in the multi-task learning setting. Following recent results to design stochastic bandit policies, we propose an efficient greedy policy based on trace norm regularization. It implicitly learns a low dimensional representation by encouraging the matrix formed by the task regression vectors to be of low rank. Unlike previous work in the literature, our policy does not need to know the rank of the underlying matrix. We derive an upper bound on the multi-task regret of our policy, which is, up to logarithmic factors, of order $\sqrt{NdT(T+d)r}$, where $T$ is the number of tasks, $r$ the rank, $d$ the number of variables and $N$ the number of rounds per task. We show the benefit of our strategy compared to the baseline $Td\sqrt{N}$ obtained by solving each task independently. We also provide a lower bound to the multi-task regret. Finally, we corroborate our theoretical findings with preliminary experiments on synthetic data.
翻译:我们研究了随机线性Bandit任务场景下的迁移学习问题。假设各任务共享一个低维线性表示,我们分析了在多任务学习设置中学习该表示的收益。基于近期随机Bandit策略设计的研究成果,我们提出了一种基于迹范数正则化的高效贪心策略。该策略通过鼓励任务回归向量构成的矩阵具有低秩性,隐式地学习低维表示。与现有文献不同,我们的策略无需获知底层矩阵的秩。我们推导出该策略的多任务遗憾上界,其阶数为$\sqrt{NdT(T+d)r}$(对数因子除外),其中T为任务数,r为秩,d为变量数,N为每轮任务的交互轮次。我们展示了该策略相较于独立求解每个任务时的基线$Td\sqrt{N}$的收益,并给出了多任务遗憾的下界。最后,我们通过合成数据的初步实验验证了理论发现。