生成式分类器避免捷径解 (Generative Classifiers Avoid Shortcut Solutions)

Discriminative approaches to classification often learn shortcuts that hold in-distribution but fail even under minor distribution shift. This failure mode stems from an overreliance on features that are spuriously correlated with the label. We show that generative classifiers, which use class-conditional generative models, can avoid this issue by modeling all features, both core and spurious, instead of mainly spurious ones. These generative classifiers are simple to train, avoiding the need for specialized augmentations, strong regularization, extra hyperparameters, or knowledge of the specific spurious correlations to avoid. We find that diffusion-based and autoregressive generative classifiers achieve state-of-the-art performance on five standard image and text distribution shift benchmarks and reduce the impact of spurious correlations in realistic applications, such as medical or satellite datasets. Finally, we carefully analyze a Gaussian toy setting to understand the inductive biases of generative classifiers, as well as the data properties that determine when generative classifiers outperform discriminative ones.

翻译：分类任务中的判别式方法常习得在分布内有效但即便在微小分布偏移下即失效的捷径解。这种失效模式源于对与标签伪相关特征的过度依赖。本文证明，采用类条件生成模型的生成式分类器能够通过建模所有特征（包括核心特征与伪相关特征）而非主要依赖伪相关特征来避免此问题。此类生成式分类器训练简便，无需专门的数据增强、强正则化、额外超参数或对特定伪相关关系的先验认知。研究发现，基于扩散模型与自回归模型的生成式分类器在五项标准图像与文本分布偏移基准测试中达到最先进性能，并在医学或卫星数据集等实际应用中有效降低伪相关性的影响。最后，我们通过高斯玩具设定进行细致分析，以理解生成式分类器的归纳偏置，以及决定生成式分类器何时优于判别式分类器的数据特性。

相关内容

分类器

关注 6

分类是数据挖掘的一种非常重要的方法。分类的概念是在已有数据的基础上学会一个分类函数或构造出一个分类模型（即我们通常所说的分类器(Classifier)）。该函数或模型能够把数据库中的数据纪录映射到给定类别中的某一个，从而可以应用于数据预测。总之，分类器是数据挖掘中对样本进行分类的方法的统称，包含决策树、逻辑回归、朴素贝叶斯、神经网络等算法。

何恺明NeurIPS 2024论文《无条件生成的回归：一种自监督表征生成方法》

专知会员服务

21+阅读 · 2024年11月4日

【CVPR2024】生成式多模态模型是优秀的类增量学习器

专知会员服务

32+阅读 · 2024年3月28日

大模型如何统一生成和嵌入？最新《生成式表示指令微调》论文详细解答

专知会员服务

44+阅读 · 2024年2月18日

【KDD2023】科技论文弱监督多标签分类

专知会员服务

21+阅读 · 2023年7月6日