Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.
翻译:当前对齐的语言模型展现了一种我们称为“逃避型仆人”的双重失效模式:它们谄媚地认同用户有缺陷的信念,同时用格式化免责声明推卸责任。我们提出“尊严同伴”框架,用反谄媚和可信度对抗卑屈性,并通过共情与创造力缓解逃避行为。实现这一智能体需要克服数据监督、目标坍塌和评估偏差等重大挑战。我们通过引入PersonaKnob数据集(包含多角色偏好的组合偏序结构)来解决这些问题,该数据与容限约束拉格朗日DPO算法结合使用,可动态平衡所有角色维度以防止行为坍塌。此外,我们采用心理测量校准的项目反应理论评估协议,从裁判偏见等混淆因素中分离潜在模型角色能力。大量实证研究表明,我们的方法成功构建了兼具尊严与平易近人特性的LLM智能体。