In classification problems with large output spaces (up to millions of labels), the last layer can require an enormous amount of memory. Using sparse connectivity would drastically reduce the memory requirements, but as we show below, it can result in much diminished predictive performance of the model. Fortunately, we found that this can be mitigated by introducing a penultimate layer of intermediate size. We further demonstrate that one can constrain the connectivity of the sparse layer to be uniform, in the sense that each output neuron will have the exact same number of incoming connections. This allows for efficient implementations of sparse matrix multiplication and connection redistribution on GPU hardware. Via a custom CUDA implementation, we show that the proposed approach can scale to datasets with 670,000 labels on a single commodity GPU with only 4GB memory.
翻译:在输出空间极大(可达数百万标签)的分类问题中,最后一层需要消耗大量内存。使用稀疏连接将大幅降低内存需求,但如下文所示,这会显著削弱模型的预测性能。幸运的是,我们发现引入一个中等尺寸的倒数第二层可以缓解这一问题。我们进一步证明,可以将稀疏层的连接约束为均匀的,即每个输出神经元拥有完全相同数量的输入连接。这有助于在GPU硬件上实现高效的稀疏矩阵乘法与连接重分配。通过自定义CUDA实现,我们展示了所提方法可在仅有4GB内存的单块普通GPU上,处理包含67万标签的数据集。