Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-following LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.
翻译:人类自陈式问卷在自然语言处理(NLP)中被越来越多地用于基准测试和审计大型语言模型(LLMs),涵盖从人格一致性到安全性与偏见评估等领域。然而,这些工具假定被调查者诚实作答;在评估性语境下,LLMs可能倾向选择社会更偏好的答案——即社会期望性应答(SDR)的一种形式——从而扭曲基于问卷的评分和下游结论。我们提出一个心理测量框架,用于量化并减轻基于问卷评估LLMs中的SDR。为量化SDR,同一套量表分别在“诚实”与“假装良好”指令下施测,并基于项目反应理论(IRT)估计的潜在得分计算方向校正的标准化效应量作为SDR。这使得跨构念和反应格式的比较,以及与人类假装作答基准的比较成为可能。在减轻SDR方面,我们通过约束优化从项目池中选取30组跨领域配对以实现期望匹配,构建了一个等级迫选(GFC)式大五人格量表。在评估九个遵循指令的LLMs(针对具有已知目标画像的合成人格)时,李克特式问卷表现出持续较大的SDR,而期望匹配的GFC则显著减弱了SDR,同时很大程度上保留了对目标人格画像的复原能力。这些结果揭示了依赖于模型的SDR-复原权衡,并促使在对LLMs进行基于问卷的基准测试和审计时采用对SDR敏感的汇报实践。