Dense prediction tasks are a fundamental class of problems in computer vision. As supervised methods suffer from high pixel-wise labeling cost, a few-shot learning solution that can learn any dense task from a few labeled images is desired. Yet, current few-shot learning methods target a restricted set of tasks such as semantic segmentation, presumably due to challenges in designing a general and unified model that is able to flexibly and efficiently adapt to arbitrary tasks of unseen semantics. We propose Visual Token Matching (VTM), a universal few-shot learner for arbitrary dense prediction tasks. It employs non-parametric matching on patch-level embedded tokens of images and labels that encapsulates all tasks. Also, VTM flexibly adapts to any task with a tiny amount of task-specific parameters that modulate the matching algorithm. We implement VTM as a powerful hierarchical encoder-decoder architecture involving ViT backbones where token matching is performed at multiple feature hierarchies. We experiment VTM on a challenging variant of Taskonomy dataset and observe that it robustly few-shot learns various unseen dense prediction tasks. Surprisingly, it is competitive with fully supervised baselines using only 10 labeled examples of novel tasks (0.004% of full supervision) and sometimes outperforms using 0.1% of full supervision. Codes are available at https://github.com/GitGyun/visual_token_matching.
翻译:密集预测任务是计算机视觉中的一类基础问题。由于监督方法需要高昂的逐像素标注成本,因此我们期望一种能从少量标注图像中学习任意密集任务的少样本学习方法。然而,当前的少样本学习方法仅针对语义分割等有限任务,这可能是由于设计一个通用且统一的模型面临挑战,该模型需能灵活高效地适应任意未见语义的任务。我们提出了视觉标记匹配(VTM),一种用于任意密集预测任务的通用少样本学习器。它采用非参数匹配方法,对图像和标签的块级嵌入标记进行匹配,从而囊括所有任务。此外,VTM通过少量任务特定参数灵活适应任意任务,这些参数可调节匹配算法。我们将VTM实现为一种强大的层次化编码器-解码器架构,采用ViT骨干网络,并在多个特征层级上执行标记匹配。我们在Taskonomy数据集的一个具有挑战性的变体上实验VTM,观察到它能够稳健地对各种未见密集预测任务进行少样本学习。令人惊讶的是,它仅使用新任务的10个标注样本(占全监督的0.004%)即可与全监督基线相当,有时甚至在使用0.1%全监督样本时表现更优。代码可在https://github.com/GitGyun/visual_token_matching获取。