Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ's applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.
翻译:摘要:大型语言模型(LLMs)正日益被用于回答主观性、信息寻求类问题,在此类场景中,用户不仅关注答案的正确性,更在意回应的传播方式。现有针对主观文化类问题的LLM评估主要聚焦于事实准确性,忽略了回应框架的构成。为此,我们提出FRANZ——一种面向回应特征化的自动化框架,沿四个维度(文化定位、泛化语言使用、拟人化线索、对话准则遵循)对LLM回应进行传播审计。为实现这一评估,我们贡献了SQUARE语料库——包含源自57个子版块的37.6万个主观问题,映射至7个国家与19个问题类别。通过评分三个开源权重LLM的回应,我们验证了FRANZ的适用性。观察发现,LLM在每种回应特征的运用频率上呈现统计显著差异。与单一维度审计不同,FRANZ揭示内部视角定位与拟人化之间存在正向耦合,且耦合程度因国家而异,为识别框架偏差提供了诊断视角。