We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all five steering vectors primarily modulate a single dominant agency axis (the disposition to act independently versus defer to the user), with trait specific effects appearing only as secondary modulations in tool-type composition and dose-response shape. The tool-use vector steers behavior (d = 0.39); the risk-calibration vector produces only suppression. We additionally show that steering only during autoregressive decoding has zero effect (p > 0.35), providing causal evidence that behavioral commitments are computed during prefill in GatedDeltaNet architectures.
翻译:我们在Qwen 3.5-35B-A3B(一个具有混合门控DeltaNet/注意力架构的350亿参数专家混合模型)的残差流上训练了九个稀疏自编码器,并利用它们识别和调控五种能动行为特质。该方法在SAE潜在激活上训练线性探针,然后将探针权重通过SAE解码器反向投影,获得模型原生激活空间中的连续调控向量。此方法绕过了SAE的top-k离散化过程,实现了无需重新训练即可在推理时进行细粒度行为干预。通过对1800次智能体推演(50种场景×36种条件)的分析,我们发现以乘数2进行自主性调控可达到科恩d值=1.01(p < 0.0001),使模型从78%概率向用户求助转变为主动执行代码和网络搜索。然而跨特质分析表明,所有五个调控向量主要调制单一主导的能动性轴(独立行动倾向与遵从用户倾向),特质特异性效应仅表现为工具类型构成和剂量响应曲线中的次级调制。工具使用向量能有效调控行为(d = 0.39);风险校准向量仅产生抑制效应。我们还证明仅在自回归解码阶段进行调控完全无效(p > 0.35),这为行为决策在门控DeltaNet架构的预填充阶段完成提供了因果证据。