We address the challenge of building task-agnostic classifiers using only text descriptions, demonstrating a unified approach to image classification, 3D point cloud classification, and action recognition from scenes. Unlike approaches that learn a fixed representation of the output classes, we generate at inference time a model tailored to a query classification task. To generate task-based zero-shot classifiers, we train a hypernetwork that receives class descriptions and outputs a multi-class model. The hypernetwork is designed to be equivariant with respect to the set of descriptions and the classification layer, thus obeying the symmetries of the problem and improving generalization. Our approach generates non-linear classifiers, handles rich textual descriptions, and may be adapted to produce lightweight models efficient enough for on-device applications. We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions: From single words to rich descriptions. Our results demonstrate strong improvements over previous approaches, showing that zero-shot learning can be applied with little training data. Furthermore, we conduct an analysis with foundational vision and language models, demonstrating that they struggle to generalize when describing what attributes the class lacks.
翻译:我们解决了仅使用文本描述构建任务无关分类器的挑战,展示了一种统一的图像分类、3D点云分类和场景动作识别方法。与学习输出类别固定表示的方法不同,我们在推理时生成针对查询分类任务定制的模型。为生成基于任务的零样本分类器,我们训练了一个超网络,该网络接收类别描述并输出多类别模型。该超网络被设计为在描述集合与分类层方面具有等变性,从而遵循问题的对称性并提升泛化能力。我们的方法能生成非线性分类器,处理丰富的文本描述,并可适配生成足够轻量高效的模型以适用于端侧应用。我们在一系列零样本分类任务(涵盖图像、点云和动作识别)中评估了该方法,使用了从单词到丰富描述的不同文本描述范围。实验结果表明,相较于先前方法有显著提升,证明零样本学习可在少量训练数据下实现。此外,我们基于基础视觉与语言模型进行了分析,发现它们在描述类别缺乏何种属性时难以有效泛化。