Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.
翻译:近期研究将心理测量问卷应用于大语言模型(LLMs),以评估价值观、人格特质、道德基础与黑暗人格等高层级心理构念。尽管先前研究已对心理测量量表可能存在的、可能威胁此类评估可靠性的数据污染问题提出关切,但尚未有系统性的尝试来量化这种污染的程度。为填补这一空白,我们提出了一个系统性测量LLMs心理测量评估中数据污染的框架,评估了三个维度:(1) 项目记忆,(2) 评估记忆,以及 (3) 目标分数匹配。通过将此框架应用于来自主要模型家族的21个模型以及四个广泛使用的心理测量量表,我们提供的证据表明,诸如大五人格量表(BFI-44)和肖像价值观问卷(PVQ-40)等流行量表存在严重的数据污染,模型不仅能够记忆项目,还能调整其回答以达到特定的目标分数。