This study presents the development of the PsyCogMetrics AI Lab (psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build-Intervene-Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.
翻译:本研究介绍了PsyCogMetrics AI实验室(psycogmetrics.ai)的开发,这是一个集成的云端平台,将心理测量学与认知科学方法操作化,用于大型语言模型评估。研究采用三循环行动设计科学框架:关联性循环识别当前评估方法的关键局限性与未满足的利益相关者需求;严谨性循环借鉴波普尔可证伪性、经典测验理论和认知负荷理论等核心理论,推导出演绎性设计目标;设计循环通过嵌套的“构建-干预-评估”迭代环将这些目标操作化。本研究贡献了一个新颖的信息技术制品——一个经过验证的LLM评估设计方案,有益于人工智能、心理学、认知科学及社会与行为科学交叉领域的研究。