Large Language Models (LLMs) are increasingly deployed in socially sensitive domains, yet their unpredictable behaviors, ranging from misaligned intent to inconsistent personality, pose significant risks. We introduce SteerEval, a hierarchical benchmark for evaluating LLM controllability across three domains: language features, sentiment, and personality. Each domain is structured into three specification levels: L1 (what to express), L2 (how to express), and L3 (how to instantiate), connecting high-level behavioral intent to concrete textual output. Using SteerEval, we systematically evaluate contemporary steering methods, revealing that control often degrades at finer-grained levels. Our benchmark offers a principled and interpretable framework for safe and controllable LLM behavior, serving as a foundation for future research.
翻译:大型语言模型(LLMs)正日益部署于社会敏感领域,但其不可预测的行为——从意图失准到人格不一致——构成了重大风险。我们提出了SteerEval,一个用于评估LLMs在三个领域可控性的分层基准:语言特征、情感和人格。每个领域被结构化为三个规范层级:L1(表达什么)、L2(如何表达)和L3(如何实例化),从而将高层行为意图与具体的文本输出相连接。利用SteerEval,我们系统评估了当代的引导方法,发现控制效果通常在更细粒度层级上会下降。我们的基准为安全且可控的LLM行为提供了一个原则性且可解释的框架,为未来研究奠定了基础。