Model diffing, the process of comparing models' internal representations to identify their differences, is a promising approach for uncovering safety-critical behaviors in new models. However, its application has so far been primarily focused on comparing a base model with its finetune. Since new LLM releases are often novel architectures, cross-architecture methods are essential to make model diffing widely applicable. Crosscoders are one solution capable of cross-architecture model diffing but have only ever been applied to base vs finetune comparisons. We provide the first application of crosscoders to cross-architecture model diffing and introduce Dedicated Feature Crosscoders (DFCs), an architectural modification designed to better isolate features unique to one model. Using this technique, we find in an unsupervised fashion features including Chinese Communist Party alignment in Qwen3-8B and Deepseek-R1-0528-Qwen3-8B, American exceptionalism in Llama3.1-8B-Instruct, and a copyright refusal mechanism in GPT-OSS-20B. Together, our results work towards establishing cross-architecture crosscoder model diffing as an effective method for identifying meaningful behavioral differences between AI models.
翻译:模型差异分析是通过比较模型内部表征以识别其差异的过程,是揭示新模型中安全关键行为的一种前景广阔的方法。然而,迄今为止,其应用主要集中于比较基础模型与其微调版本。由于新发布的大语言模型通常采用新颖的架构,跨架构方法对于使模型差异分析得到广泛应用至关重要。Crosscoders是一种能够进行跨架构模型差异分析的解决方案,但此前仅应用于基础模型与微调模型的比较。我们首次将Crosscoders应用于跨架构模型差异分析,并引入专用特征Crosscoders,这是一种旨在更好地隔离某一模型独有特征的架构改进。利用该技术,我们以无监督方式发现了包括Qwen3-8B和Deepseek-R1-0528-Qwen3-8B中的中国共产党对齐特征、Llama3.1-8B-Instruct中的美国例外主义特征,以及GPT-OSS-20B中的版权拒绝机制特征。总之,我们的研究成果有助于确立跨架构Crosscoders模型差异分析作为识别人工智能模型间有意义行为差异的有效方法。