基于Rasch模型的考试检验与测量不变性评估 (Examining Exams Using Rasch Models and Assessment of Measurement Invariance)

Many statisticians regularly teach large lecture courses on statistics, probability, or mathematics for students from other fields such as business and economics, social sciences and psychology, etc. The corresponding exams often use a multiple-choice or single-choice format and are typically evaluated and graded automatically, either by scanning printed exams or via online learning management systems. Although further examinations of these exams would be of interest, these are frequently not carried out. For example a measurement scale for the difficulty of the questions (or items) and the ability of the students (or subjects) could be established using psychometric item response theory (IRT) models. Moreover, based on such a model it could be assessed whether the exam is really fair for all participants or whether certain items are easier (or more difficult) for certain subgroups of students. Here, several recent methods for assessing measurement invariance and for detecting differential item functioning in the Rasch IRT model are discussed and applied to results from a first-year mathematics exam with single-choice items. Several categorical, ordered, and numeric covariates like gender, prior experience, and prior mathematics knowledge are available to form potential subgroups with differential item functioning. Specifically, all analyses are demonstrated with a hands-on R tutorial using the psycho* family of R packages (psychotools, psychotree, psychomix) which provide a unified approach to estimating, visualizing, testing, mixing, and partitioning a range of psychometric models. The paper is dedicated to the memory of Fritz Leisch (1968-2024) and his contributions to various aspects of this work are highlighted.

翻译：许多统计学家定期为来自商业与经济、社会科学与心理学等其他领域的学生讲授统计学、概率论或数学等大型讲座课程。相应的考试通常采用多项选择或单项选择形式，并通过扫描纸质试卷或在线学习管理系统进行自动评估与评分。尽管对这些考试进行深入分析具有重要意义，但此类分析往往未能实施。例如，可利用心理测量学中的项目反应理论（IRT）模型建立衡量试题（或项目）难度与学生（或被试）能力的测量量表。此外，基于此类模型可评估考试是否真正对所有参与者公平，或特定试题是否对某些学生亚群更易（或更难）。本文讨论并应用了Rasch IRT模型中评估测量不变性与检测差异项目功能的若干新方法，将其应用于包含单项选择题的一年级数学考试结果。研究利用性别、先验经验与先验数学知识等分类、有序及数值协变量构建可能存在差异项目功能的潜在亚群。具体而言，所有分析均通过实践性R语言教程进行演示，该教程使用psycho*系列R包（psychotools、psychotree、psychomix），这些工具包为估计、可视化、检验、混合与划分各类心理测量模型提供了统一框架。本文谨以此纪念Fritz Leisch（1968-2024），并特别强调他在本研究多方面工作中所作出的贡献。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日