Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.

翻译：人类自我报告问卷在自然语言处理领域正日益广泛地用于基准测试和审计大语言模型，涵盖角色一致性、安全性及偏见评估等多个方面。然而，这些工具预设了诚实作答；在评估情境中，大语言模型反而可能倾向于选择社会偏好的答案——一种社会称许性响应形式——从而导致问卷衍生分数及后续结论产生偏差。我们提出一个心理测量框架，用于量化并缓解基于问卷的大语言模型评估中的社会称许性响应。为量化社会称许性响应，同一量表分别在“诚实”与“假装良好”的指导语下施测，并通过项目反应理论估计的潜在分数计算方向校正的标准化效应大小作为社会称许性响应指标。这使得跨构念、跨反应格式的比较，以及与人类受指导伪装基准的比较成为可能。为缓解此效应，我们通过约束优化从项目池中选取30个跨领域配对，构建了一个称许性匹配的等级迫选大五人格量表，以平衡各选项的称许性。在九个经过指令微调的大语言模型上，针对具有已知目标特征的合成角色进行评估，结果显示：李克特式问卷持续表现出较大的社会称许性响应，而称许性匹配的等级迫选量表则能显著减弱社会称许性响应，同时基本保持对目标角色特征的还原度。这些结果突显了模型依赖的社会称许性响应与特征还原之间的权衡关系，并促使我们在基于问卷的大语言模型基准测试与审计中采用考虑社会称许性响应的报告实践。