Automatic Item Generation for Personality Situational Judgment Tests with Large Language Models

Personality assessment through situational judgment tests (SJTs) offers unique advantages over traditional Likert-type self-report scales, yet their development remains labor-intensive, time-consuming, and heavily dependent on subject matter experts. Recent advances in large language models (LLMs) have shown promise for automatic item generation (AIG). Building on these developments, the present study focuses on developing and evaluating a structured and generalizable framework for automatically generating personality SJTs, using GPT-4 and ChatGPT-5 as empirical examples. Three studies were conducted. Study 1 systematically compared the effects of prompt design and temperature settings on the content validity of LLM-generated items to develop an effective and stable LLM-based AIG approach for personality SJT. Results showed that optimized prompts and a temperature of 1.0 achieved the best balance of creativity and accuracy on GPT-4. Study 2 examined the cross-model generalizability and reproducibility of this automated SJT generation approach through multiple rounds. The results showed that the approach consistently produced reproducible and high-quality items on ChatGPT-5. Study 3 evaluated the psychometric properties of LLM-generated SJTs covering five facets of the Big Five personality traits. Results demonstrated satisfactory reliability and validity across most facets, though limitations were observed in the convergent validity of the compliance facet and certain aspects of criterion-related validity. These findings provide robust evidence that the proposed LLM-based AIG approach can produce culturally appropriate and psychometrically sound SJTs with efficiency comparable to or exceeding traditional methods.

翻译：通过情境判断测验（SJTs）进行人格评估相较于传统的李克特式自陈量表具有独特优势，但其开发过程依然费力耗时，且高度依赖领域专家。大语言模型（LLMs）的最新进展为自动项目生成（AIG）带来了希望。基于这些发展，本研究聚焦于开发和评估一个结构化、可推广的框架，用于自动生成人格SJTs，并以GPT-4和ChatGPT-5作为实证示例。研究共包含三项子研究。研究一系统比较了提示设计和温度设置对LLM生成项目内容效度的影响，旨在为基于LLM的人格SJT AIG方法开发一个有效且稳定的方案。结果表明，在GPT-4上，优化的提示和1.0的温度设置能在创造性与准确性之间达到最佳平衡。研究二通过多轮测试，检验了这种自动化SJT生成方法的跨模型泛化性和可复现性。结果显示，该方法在ChatGPT-5上能持续生成可复现的高质量项目。研究三评估了LLM生成的SJTs的心理测量学特性，这些SJTs覆盖了大五人格特质的五个方面。结果表明，在大多数方面，测验的信度和效度均令人满意，尽管在顺从性维度的聚合效度以及某些效标关联效度方面观察到了局限性。这些发现提供了有力证据，表明所提出的基于LLM的AIG方法能够高效生成文化适宜且心理测量学指标良好的SJTs，其效率与传统方法相当或更优。