Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs

LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.

翻译：大语言模型通过训练具备多语言能力，但其通用语常为英语，这反映了英语在预训练中的主导地位。其他语言虽存于参数记忆，却系统性受到抑制。我们提出语言默认性由稀疏低秩控制电路——语言神经元主导，该电路可被机械分离并安全操控。我们提出神经FOXP2方法，通过操控语言特异性神经元，使选定语言（印地语或西班牙语）成为模型的主语言。该方法分三个阶段实施：(i) 定位阶段：为每层训练稀疏自编码器，将每个激活分解为少量活跃特征组件。针对每个特征，通过衡量目标语言标记集上的整体对数概率质量提升，量化英语与印地语/西班牙语的选择性。将排名靠前的特征追溯至其最强贡献单元，得到紧凑的语言神经元集合。(ii) 操控方向阶段：通过谱低秩分析定位可控语言漂移几何结构。对每层构建英语到目标语言的激活差异矩阵，执行分层奇异值分解以提取主导语言变化的奇异方向。能隙与有效秩谱可识别紧凑操控子空间及经验性干预窗口（该方向最强且最稳定）。(iii) 操控阶段：对语言神经元施加带符号的稀疏激活偏移。具体而言，在低中层沿目标语言主导方向施加正偏移，同时对英语神经元向零空间施加补偿性负偏移，实现可控的目标语言默认性。