Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.
翻译:跨模型和跨模态的习得表征常常展现出惊人的结构相似性,暗示着共享的底层概念分解。然而,概念对齐的定义仍然不明确:现有方法在相同术语下优化不同目标,掩盖了实际对齐的内容。我们提出了一个统一框架,沿着两个轴分解对齐:对齐什么(表征vs.概念)以及对齐在什么层面(实例级vs.分布级)。这产生了四个相应的属性——翻译一致性和概念一致性的实例级与分布级变体——并精确揭示了现有方法提供哪些保证。我们进一步引入InterVenchA,这是一个基于干预的基准,分别测量提取质量、翻译质量和概念一致性。通过理论和实验,我们展示了常被假设的齐目标之间的等价性在实践中不成立:优化一个属性并不能可靠地恢复其他属性,且纯无监督目标无法恢复有意义的实例级对齐。然后,我们提出了耦合稀疏自编码器(CoSAE),它联合强制了互补的对齐目标。只有在这种机制下,强对齐才会出现。令人惊讶的是,当锚定分布级目标时,仅需0.1%的配对数据就足以恢复实例级对齐。总体而言,我们的结果表明概念对齐本质上是多目标的:它必须被相应地定义、测量和优化。