A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.
翻译:计算视觉中的一个核心问题是:类人视觉表征更应归功于判别式学习还是生成式学习。然而,现有比较往往混淆了学习目标与架构、规模及训练数据,使得学习目标本身是否驱动对齐这一问题悬而未决。我们利用联合能量模型(JEMs)解决了这一混淆问题——JEMs能够在固定架构内,在判别式与生成式训练之间进行连续插值。通过改变单一混合系数,我们分离出学习目标的影响,并在涵盖感知相似性、光泽感知、人类响应不确定性、鲁棒性、形状-纹理线索冲突及诊断性特征归因的六项人类对齐基准上评估了所得模型。在这组多样化的评测中,人类对齐始终在生成-判别连续体的中间点达到最大值,而非两端点。混合JEMs将判别式学习诱导的类别结构,与生成式学习诱导的输入敏感性相结合,从而在多个视觉层次上产生了更具类人特性的行为。这些结果表明,生成-判别的二分法并非理解类人视觉的正确轴线:对齐并非源于选择某一目标而摒弃另一目标,而是源于两者间的平衡。