Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ's applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.
翻译:摘要:大语言模型正越来越多地被用于回答主观性、信息寻求型问题。在此类应用中,用户不仅关注答案的正确性,更敏感于回应传达的方式。现有针对主观文化类查询的大语言模型评估主要聚焦于事实准确性,忽视了回应框架的塑造方式。为此,我们提出FRANZ——一种用于回应表征的自动化框架,从四个维度对大语言模型的回应进行沟通性审计:文化定位、泛化语言使用、拟人化线索及对话准则遵循度。为支撑该评估,我们构建了SQUARE语料库,包含来自57个Reddit子版块的37.6万条主观问题,并映射至7个国家和19个问题类别。通过对三种开放权重大语言模型的回应评分,我们展示了FRANZ的适用性。研究发现,大语言模型在采用各类回应特征的频率上存在统计显著性差异。与单维度审计不同,FRANZ揭示了内群体定位与拟人化之间存在正向耦合,且耦合程度因国家而异,从而为识别框架差异提供了诊断视角。