Supervised learning typically focuses on learning transferable representations from training examples annotated by humans. While rich annotations (like soft labels) carry more information than sparse annotations (like hard labels), they are also more expensive to collect. For example, while hard labels only provide information about the closest class an object belongs to (e.g., "this is a dog"), soft labels provide information about the object's relationship with multiple classes (e.g., "this is most likely a dog, but it could also be a wolf or a coyote"). We use information theory to compare how a number of commonly-used supervision signals contribute to representation-learning performance, as well as how their capacity is affected by factors such as the number of labels, classes, dimensions, and noise. Our framework provides theoretical justification for using hard labels in the big-data regime, but richer supervision signals for few-shot learning and out-of-distribution generalization. We validate these results empirically in a series of experiments with over 1 million crowdsourced image annotations and conduct a cost-benefit analysis to establish a tradeoff curve that enables users to optimize the cost of supervising representation learning on their own datasets.
翻译:监督学习通常侧重于从人类标注的训练示例中学习可迁移的表征。虽然丰富的标注(如软标签)比稀疏标注(如硬标签)携带更多信息,但收集成本也更高。例如,硬标签仅提供关于对象所属最相关类别的信息(如“这是一只狗”),而软标签则提供对象与多个类别的关系信息(如“这很可能是一只狗,但也可能是狼或郊狼”)。我们利用信息论比较了几种常用监督信号对表征学习性能的贡献,以及它们的容量如何受标签数量、类别数、维度数和噪声等因素的影响。我们的框架为大数据场景下使用硬标签提供了理论依据,同时论证了小样本学习和分布外泛化需要更丰富的监督信号。我们通过一系列涉及超过100万张众包图像标注的实验验证了这些结果,并进行了成本效益分析以建立权衡曲线,使用户能够优化其自身数据集上监督表征学习的标注成本。