Supervised learning typically focuses on learning transferable representations from training examples annotated by humans. While rich annotations (like soft labels) carry more information than sparse annotations (like hard labels), they are also more expensive to collect. For example, while hard labels only provide information about the closest class an object belongs to (e.g., "this is a dog"), soft labels provide information about the object's relationship with multiple classes (e.g., "this is most likely a dog, but it could also be a wolf or a coyote"). We use information theory to compare how a number of commonly-used supervision signals contribute to representation-learning performance, as well as how their capacity is affected by factors such as the number of labels, classes, dimensions, and noise. Our framework provides theoretical justification for using hard labels in the big-data regime, but richer supervision signals for few-shot learning and out-of-distribution generalization. We validate these results empirically in a series of experiments with over 1 million crowdsourced image annotations and conduct a cost-benefit analysis to establish a tradeoff curve that enables users to optimize the cost of supervising representation learning on their own datasets.
翻译:监督学习通常侧重于从人类标注的训练样本中学习可迁移的表示。虽然丰富标注(如软标签)携带的信息量多于稀疏标注(如硬标签),但其收集成本也更高。例如,硬标签仅提供关于对象所属最近类别的信息(如“这是一只狗”),而软标签则提供对象与多个类别的关系信息(如“这很可能是一只狗,但也可能是狼或郊狼”)。我们运用信息论比较了若干常用监督信号对表示学习性能的贡献,以及其容量如何受标签数量、类别数量、维度和噪声等因素的影响。我们的理论框架为大数据场景下使用硬标签提供了理论依据,同时指出在少样本学习和分布外泛化任务中应使用更丰富的监督信号。我们通过一系列实验(涵盖超过100万条众包图像标注)对这些结果进行了实证验证,并开展成本收益分析以建立权衡曲线,从而使用户能够优化其数据集上监督表示学习的标注成本。