Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ's applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.
翻译:大型语言模型正越来越多地被用于回答主观性的、寻求信息的问题,在这些问题中,用户不仅关注答案是否正确,还对回复的表达方式高度敏感。现有的针对主观文化查询的LLM评估主要侧重于事实正确性,忽略了回复的框架。为此,我们提出了FRANZ,一个用于回复特征量化的自动化框架,以沿着四个维度对LLM回复进行交际审计:文化定位、泛化语言的使用、拟人化线索以及对会话格言的遵循。为了实现这一评估,我们贡献了SQUARE数据集——一个包含来自57个子论坛的37.6万个主观问题,并映射到7个国家和19个问题类别的语料库。我们通过对三个开放权重LLM的回复进行评分,展示了FRANZ的适用性。我们观察到,LLM在使用每种回复特征的频率上表现出统计学上的显著差异。与单维度审计不同,FRANZ揭示了内部定位与拟人化之间存在正向耦合关系,且耦合程度因国家而异,为识别框架分歧提供了诊断视角。