Metalearning with Very Few Samples Per Task

Metalearning and multitask learning are two frameworks for solving a group of related learning tasks more efficiently than we could hope to solve each of the individual tasks on their own. In multitask learning, we are given a fixed set of related learning tasks and need to output one accurate model per task, whereas in metalearning we are given tasks that are drawn i.i.d. from a metadistribution and need to output some common information that can be easily specialized to new tasks from the metadistribution. We consider a binary classification setting where tasks are related by a shared representation, that is, every task $P$ can be solved by a classifier of the form $f_{P} \circ h$ where $h \in H$ is a map from features to a representation space that is shared across tasks, and $f_{P} \in F$ is a task-specific classifier from the representation space to labels. The main question we ask is how much data do we need to metalearn a good representation? Here, the amount of data is measured in terms of the number of tasks $t$ that we need to see and the number of samples $n$ per task. We focus on the regime where $n$ is extremely small. Our main result shows that, in a distribution-free setting where the feature vectors are in $\mathbb{R}^d$, the representation is a linear map from $\mathbb{R}^d \to \mathbb{R}^k$, and the task-specific classifiers are halfspaces in $\mathbb{R}^k$, we can metalearn a representation with error $\varepsilon$ using $n = k+2$ samples per task, and $d \cdot (1/\varepsilon)^{O(k)}$ tasks. Learning with so few samples per task is remarkable because metalearning would be impossible with $k+1$ samples per task, and because we cannot even hope to learn an accurate task-specific classifier with $k+2$ samples per task. Our work also yields a characterization of distribution-free multitask learning and reductions between meta and multitask learning.

翻译：元学习和多任务学习是两种框架，旨在更高效地解决一组相关学习任务，而非单独求解每个任务。在多任务学习中，我们给定一组固定且相关的学习任务，需要为每个任务输出一个精确模型；而在元学习中，任务从元分布中独立同分布地抽取，我们需要输出一些通用信息，以便能轻松适应元分布中的新任务。我们考虑一个二元分类场景，其中任务通过共享表示相关联：每个任务$P$均可由形如$f_{P} \circ h$的分类器求解，其中$h \in H$是从特征到表示空间的映射（该表示空间跨任务共享），而$f_{P} \in F$是从表示空间到标签的任务特定分类器。核心问题是：元学习一个优质表示需要多少数据？此处数据量由所需观察的任务数$t$和每任务样本数$n$衡量。我们聚焦于$n$极小的情形。主要结果表明：在特征向量位于$\mathbb{R}^d$、表示为从$\mathbb{R}^d$到$\mathbb{R}^k$的线性映射、任务特定分类器为$\mathbb{R}^k$中半空间的无分布设定下，使用每任务$n = k+2$个样本及$d \cdot (1/\varepsilon)^{O(k)}$个任务，即可元学习出误差为$\varepsilon$的表示。每任务仅需如此少样本即可学习令人瞩目，因为当每任务样本数为$k+1$时元学习不可能实现，且即便拥有$k+2$个样本也无法期望训练出精确的任务特定分类器。本文还给出了无分布多任务学习的刻画，并建立了元学习与多任务学习之间的归约关系。