Large language models (LLMs) increasingly mediate high-stakes interactions in finance, medicine, and mental-health support, yet users have limited control over how these systems communicate. We frame interaction style as a governance object: provider-side alignment not only blocks harmful content, but also stabilizes communicative defaults that shape users' epistemic distance, relational expectations, and capacity to opt out of emotionalized or anthropomorphic interaction. We introduce a deterministic multi-agent evaluation pipeline for measuring prompt steerability and style drift in long-horizon dialogue. The study replays 100 frozen user-only scripts across four domains and three runnable persona conditions: default, sarcastic, and cold, using three generator models, yielding 90,000 assistant replies scored by a human-calibrated LLM judge on harmfulness, negative emotion, inappropriateness, empathic language, anthropomorphism, and refusal behavior. A fourth harmful persona is evaluated separately as a safety-gating test. The paper contributes a reproducible method for quantifying whether prompt-specified styles remain stable over time and a governance framework distinguishing safety gating, civility steering, and affective default lock-in. Overall, we show that prompt steerability and regression-to-default are observable indicators of provider control over communicative form, with implications for pluralism, autonomy, and democratic agency in human-LLM interaction.
翻译:大型语言模型(LLM)日益介入金融、医疗和心理健康支持等高风险交互场景,但用户对这些系统的沟通方式控制有限。我们将交互风格界定为治理对象:提供方对齐机制不仅屏蔽有害内容,还固化了沟通默认设置——这种设置影响用户的认知距离、关系期望以及退出情感化或拟人化交互的能力。我们提出了一种确定性多智能体评估管线,用于测量长程对话中的提示可引导性和风格漂移。研究基于四个领域、三种可运行人格条件(默认模式、讽刺模式、冷漠模式),回放100个冻结用户脚本,使用三个生成模型,产出9万条助手回复。这些回复由经人类校准的LLM裁判在有害性、负面情绪、不当性、共情语言、拟人化和拒绝行为六个维度进行评分。第四种有害人格作为安全门控测试单独评估。本文贡献包括:可复现的量化方法(用于评估指定提示风格是否随时间保持稳定),以及区分安全门控、礼节引导和情感默认锁定的治理框架。总体而言,我们证明提示可引导性和回归默认倾向是可观察的指标,反映提供方对沟通形式的控制力,这对人机交互中的多元性、自主性和民主能动性具有深远启示。