Error-correcting codes (ECC) are used to reduce multiclass classification tasks to multiple binary classification subproblems. In ECC, classes are represented by the rows of a binary matrix, corresponding to codewords in a codebook. Codebooks are commonly either predefined or problem dependent. Given predefined codebooks, codeword-to-class assignments are traditionally overlooked, and codewords are implicitly assigned to classes arbitrarily. Our paper shows that these assignments play a major role in the performance of ECC. Specifically, we examine similarity-preserving assignments, where similar codewords are assigned to similar classes. Addressing a controversy in existing literature, our extensive experiments confirm that similarity-preserving assignments induce easier subproblems and are superior to other assignment policies in terms of their generalization performance. We find that similarity-preserving assignments make predefined codebooks become problem-dependent, without altering other favorable codebook properties. Finally, we show that our findings can improve predefined codebooks dedicated to extreme classification.
翻译:纠错码(ECC)用于将多分类任务分解为多个二分类子问题。在ECC中,类别由二进制矩阵的行表示,对应码本中的码字。码本通常要么是预定义的,要么是依赖问题的。对于预定义码本,传统上忽略了码字到类别的分配,码字被隐式地任意分配给类别。本文表明,这些分配在ECC性能中起着重要作用。具体而言,我们研究了相似性保持分配,即将相似码字分配给相似类别。针对现有文献中的争议,我们的大量实验证实,相似性保持分配能产生更简单的子问题,并且在泛化性能上优于其他分配策略。我们发现,相似性保持分配使预定义码本变得依赖问题,同时不改变其他有利的码本属性。最后,我们证明这些发现可以改进专用于极端分类的预定义码本。