The task of persona-steered text generation requires large language models (LLMs) to generate text that reflects the distribution of views that an individual fitting a persona could have. People have multifaceted personas, but prior work on bias in LLM-generated opinions has only explored multiple-choice settings or one-dimensional personas. We define an incongruous persona as a persona with multiple traits where one trait makes its other traits less likely in human survey data, e.g. political liberals who support increased military spending. We find that LLMs are 9.7% less steerable towards incongruous personas than congruous ones, sometimes generating the stereotypical stance associated with its demographic rather than the target stance. Models that we evaluate that are fine-tuned with Reinforcement Learning from Human Feedback (RLHF) are more steerable, especially towards stances associated with political liberals and women, but present significantly less diverse views of personas. We also find variance in LLM steerability that cannot be predicted from multiple-choice opinion evaluation. Our results show the importance of evaluating models in open-ended text generation, as it can surface new LLM opinion biases. Moreover, such a setup can shed light on our ability to steer models toward a richer and more diverse range of viewpoints.
翻译:角色引导文本生成任务要求大型语言模型生成能够反映符合特定角色的个体可能持有的观点分布的文本。人物角色具有多面性,但先前关于LLM生成观点偏见的研究仅探讨了多项选择场景或一维角色。我们将"不一致角色"定义为具有多个特征、且其中某一特征使其其他特征在人类调查数据中出现概率降低的角色,例如支持增加军费开支的政治自由派。研究发现,LLM对不一致角色的引导性比对一致角色低9.7%,有时会生成与其人口统计特征相关的刻板立场而非目标立场。通过人类反馈强化学习微调的模型展现出更高的引导性,特别是对政治自由派和女性相关立场的引导,但呈现的角色观点多样性显著降低。我们还发现LLM引导性的变化无法通过多项选择观点评估来预测。研究结果表明,在开放式文本生成中评估模型至关重要,因为这能够揭示新的LLM观点偏见。此外,这种评估框架能够阐明我们将模型引导至更丰富多元观点范围的能力。