Sigmoid output layers are widely used in multi-label classification (MLC) tasks, in which multiple labels can be assigned to any input. In many practical MLC tasks, the number of possible labels is in the thousands, often exceeding the number of input features and resulting in a low-rank output layer. In multi-class classification, it is known that such a low-rank output layer is a bottleneck that can result in unargmaxable classes: classes which cannot be predicted for any input. In this paper, we show that for MLC tasks, the analogous sigmoid bottleneck results in exponentially many unargmaxable label combinations. We explain how to detect these unargmaxable outputs and demonstrate their presence in three widely used MLC datasets. We then show that they can be prevented in practice by introducing a Discrete Fourier Transform (DFT) output layer, which guarantees that all sparse label combinations with up to $k$ active labels are argmaxable. Our DFT layer trains faster and is more parameter efficient, matching the F1@k score of a sigmoid layer while using up to 50% fewer trainable parameters. Our code is publicly available at https://github.com/andreasgrv/sigmoid-bottleneck.
翻译:S型输出层广泛应用于多标签分类任务中,此类任务允许为同一输入分配多个标签。在许多实际的多标签分类场景中,候选标签数量可达数千,往往超过输入特征维度,导致输出层呈现低秩结构。在多类分类任务中,已知低秩输出层会产生瓶颈效应,导致某些类别无法通过任何输入被预测(即不可最大可达类别)。本文证明,在多标签分类中,类似的S型瓶颈会导致指数级数量的标签组合不可最大可达。我们阐释了检测这些不可最大可达输出的方法,并在三个广泛使用的多标签数据集中验证了其存在性。进一步研究表明,通过引入离散傅里叶变换(DFT)输出层可有效避免该问题——该结构能保证所有包含至多k个激活标签的稀疏组合均可实现最大可达性。我们的DFT层训练速度更快且参数效率更高,在使用参数数量减少50%的情况下,仍能达到与S型层相当的F1@k分数。相关代码已开源至https://github.com/andreasgrv/sigmoid-bottleneck。