How Aligned are Different Alignment Metrics?

In recent years, various methods and benchmarks have been proposed to empirically evaluate the alignment of artificial neural networks to human neural and behavioral data. But how aligned are different alignment metrics? To answer this question, we analyze visual data from Brain-Score (Schrimpf et al., 2018), including metrics from the model-vs-human toolbox (Geirhos et al., 2021), together with human feature alignment (Linsley et al., 2018; Fel et al., 2022) and human similarity judgements (Muttenthaler et al., 2022). We find that pairwise correlations between neural scores and behavioral scores are quite low and sometimes even negative. For instance, the average correlation between those 80 models on Brain-Score that were fully evaluated on all 69 alignment metrics we considered is only 0.198. Assuming that all of the employed metrics are sound, this implies that alignment with human perception may best be thought of as a multidimensional concept, with different methods measuring fundamentally different aspects. Our results underline the importance of integrative benchmarking, but also raise questions about how to correctly combine and aggregate individual metrics. Aggregating by taking the arithmetic average, as done in Brain-Score, leads to the overall performance currently being dominated by behavior (95.25% explained variance) while the neural predictivity plays a less important role (only 33.33% explained variance). As a first step towards making sure that different alignment metrics all contribute fairly towards an integrative benchmark score, we therefore conclude by comparing three different aggregation options.

翻译：近年来，学界提出了多种方法和基准来实证评估人工神经网络与人类神经及行为数据的对齐程度。然而，不同的对齐度量指标之间究竟具有多高的一致性？为回答此问题，我们分析了来自Brain-Score（Schrimpf等人，2018）的视觉数据，包括模型-人类工具箱（Geirhos等人，2021）中的度量指标，以及人类特征对齐（Linsley等人，2018；Fel等人，2022）和人类相似性判断（Muttenthaler等人，2022）数据。研究发现，神经评分与行为评分之间的两两相关性普遍较低，有时甚至呈负相关。例如，在Brain-Score中完全通过全部69项对齐度量指标评估的80个模型，其平均相关性仅为0.198。若假设所有采用的度量指标均合理有效，则意味着人类感知对齐应被视为一个多维概念，不同方法测量的是本质不同的维度。本研究结果既凸显了综合性基准测试的重要性，也引发了关于如何正确组合与聚合个体度量指标的思考。采用算术平均进行聚合（如Brain-Score现行方法）会导致整体性能目前主要由行为数据主导（可解释方差达95.25%），而神经预测性仅发挥次要作用（可解释方差仅为33.33%）。为确保不同对齐度量指标能公平地贡献于综合性基准评分，我们最终比较了三种不同的聚合方案作为初步探索。