量化大型语言模型心理测量评估中的数据污染 (Quantifying Data Contamination in Psychometric Evaluations of LLMs)

Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.

翻译：近期研究将心理测量问卷应用于大型语言模型（LLMs），以评估价值观、人格特质、道德基础与黑暗人格等高阶心理构念。尽管先前研究已对心理测量量表可能存在的数据污染提出关切——此类污染可能危及评估的可靠性——但尚未有系统性的尝试来量化这种污染的程度。为填补这一空白，我们提出了一个系统性测量LLMs心理测量评估中数据污染的框架，评估以下三个方面：（1）项目记忆，（2）评估记忆，以及（3）目标分数匹配。将该框架应用于来自主要模型家族的21个模型及四个广泛使用的心理测量量表后，我们提供的证据表明，诸如大五人格量表（BFI-44）与肖像价值观问卷（PVQ-40）等常用量表存在严重的数据污染，模型不仅能够记忆量表项目，还能调整其回答以达到特定的目标分数。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日