The Effectiveness of Style Vectors for Steering Large Language Models: A Human Evaluation

Controlling the behavior of large language models (LLMs) at inference time is essential for aligning outputs with human abilities and safety requirements. \emph{Activation steering} provides a lightweight alternative to prompt engineering and fine-tuning by directly modifying internal activations to guide generation. This research advances the literature in three significant directions. First, while previous work demonstrated the technical feasibility of steering emotional tone using automated classifiers, this paper presents the first human evaluation of activation steering concerning the emotional tone of LLM outputs, collecting over 7,000 crowd-sourced ratings from 190 participants via Prolific ($n=190$). These ratings assess both perceived emotional intensity and overall text quality. Second, we find strong alignment between human and model-based quality ratings (mean $r=0.776$, range $0.157$--$0.985$), indicating automatic scoring can proxy perceived quality. Moderate steering strengths ($λ\approx 0.15$) reliably amplify target emotions while preserving comprehensibility, with the strongest effects for disgust ($η_p^2 = 0.616$) and fear ($η_p^2 = 0.540$), and minimal effects for surprise ($η_p^2 = 0.042$). Finally, upgrading from Alpaca to LlaMA-3 yielded more consistent steering with significant effects across emotions and strengths (all $p < 0.001$). Inter-rater reliability was high (ICC $= 0.71$--$0.87$), underscoring the robustness of the findings. These findings support activation-based control as a scalable method for steering LLM behavior across affective dimensions.

翻译：在推理阶段控制大型语言模型（LLMs）的行为对于使输出符合人类偏好与安全要求至关重要。\emph{激活引导}通过直接修改内部激活来指导生成，为提示工程和微调提供了一种轻量级替代方案。本研究在三个重要方向上推进了现有文献。首先，尽管先前研究已利用自动分类器证明了引导情感基调的技术可行性，本文首次针对LLM输出情感基调的激活引导进行了人类评估，通过Prolific平台从190名参与者收集了超过7,000份众包评分（$n=190$）。这些评分同时评估了感知情感强度和整体文本质量。其次，我们发现人类评分与基于模型的质评分间存在高度一致性（平均$r=0.776$，范围$0.157$--$0.985$），表明自动评分可作为感知质量的代理指标。中等引导强度（$λ\approx 0.15$）能可靠地增强目标情感同时保持可理解性，其中厌恶（$η_p^2 = 0.616$）和恐惧（$η_p^2 = 0.540$）的引导效果最为显著，而惊讶（$η_p^2 = 0.042$）的效果最弱。最后，从Alpaca升级到LlaMA-3模型产生了更稳定的引导效果，在不同情感类型和强度下均表现出显著效应（所有$p < 0.001$）。评分者间信度处于高水平（ICC $= 0.71$--$0.87$），进一步印证了研究结果的稳健性。这些发现支持基于激活的控制方法可作为跨情感维度引导LLM行为的可扩展方案。