When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.
翻译:当大语言模型(LLMs)被应用于法律、医疗和金融建议等高风险场景时,即使仅凭单次对话历史,就足以导致不同用户间存在结果差异。已有研究表明,这会导致社会人口学群体间的结果差异,某些群体获得比其他群体更有利的结果。在本研究中,我们证明大语言模型实际上难以从单次对话历史中推断用户的社会人口学特征,且尽管存在群体间差异,其程度极为有限。为探究造成这些差异的主要驱动因素,我们将用户社会人口学特征与对话的各类(心理)语言学特征(包括对话主题、情感和可读性)进行比较。研究发现,在对话上下文中,对话主题最能预测大语言模型生成的建议,其在某种程度上充当着社会人口学群体的代理变量,且常以难以预测的方式影响建议内容。这一发现令人担忧,凸显了未来研究需要更深入理解——若有必要则需缓解——高风险场景中对话上下文对大语言模型输出影响的重要性。