For socially sensitive tasks like hate speech detection, the quality of explanations from Large Language Models (LLMs) is crucial for factors like user trust and model alignment. While Persona prompting (PP) is increasingly used as a way to steer model towards user-specific generation, its effect on model rationales remains underexplored. We investigate how LLM-generated rationales vary when conditioned on different simulated demographic personas. Using datasets annotated with word-level rationales, we measure agreement with human annotations from different demographic groups, and assess the impact of PP on model bias and human alignment. Our evaluation across three LLMs results reveals three key findings: (1) PP improving classification on the most subjective task (hate speech) but degrading rationale quality. (2) Simulated personas fail to align with their real-world demographic counterparts, and high inter-persona agreement shows models are resistant to significant steering. (3) Models exhibit consistent demographic biases and a strong tendency to over-flag content as harmful, regardless of PP. Our findings reveal a critical trade-off: while PP can improve classification in socially-sensitive tasks, it often comes at the cost of rationale quality and fails to mitigate underlying biases, urging caution in its application.
翻译:在仇恨言论检测等社会敏感性任务中,大语言模型(LLMs)生成解释的质量对于用户信任和模型对齐等因素至关重要。尽管人物角色提示(PP)作为一种引导模型进行用户定制生成的方法日益普及,但其对模型推理过程的影响仍未得到充分探究。本研究探讨了当大语言模型基于不同模拟人口统计角色生成解释时,其推理依据如何变化。通过使用带有词级标注依据的数据集,我们测量了模型解释与不同人口统计群体人工标注之间的一致性,并评估了PP对模型偏见和人类对齐的影响。我们在三种大语言模型上的评估结果揭示了三个关键发现:(1)PP在最具主观性的任务(仇恨言论检测)上改善了分类性能,但降低了推理依据的质量。(2)模拟角色未能与其真实世界的人口统计对应群体对齐,且高跨角色一致性表明模型对显著引导具有抵抗性。(3)无论是否使用PP,模型都表现出持续的人口统计偏见和过度标记内容为有害的强烈倾向。我们的研究揭示了一个关键权衡:虽然PP能提升社会敏感性任务的分类性能,但这往往以牺牲推理质量为代价,且无法缓解深层偏见,这警示我们需要谨慎应用该方法。