Many statisticians regularly teach large lecture courses on statistics, probability, or mathematics for students from other fields such as business and economics, social sciences and psychology, etc. The corresponding exams often use a multiple-choice or single-choice format and are typically evaluated and graded automatically, either by scanning printed exams or via online learning management systems. Although further examinations of these exams would be of interest, these are frequently not carried out. For example a measurement scale for the difficulty of the questions (or items) and the ability of the students (or subjects) could be established using psychometric item response theory (IRT) models. Moreover, based on such a model it could be assessed whether the exam is really fair for all participants or whether certain items are easier (or more difficult) for certain subgroups of students. Here, several recent methods for assessing measurement invariance and for detecting differential item functioning in the Rasch IRT model are discussed and applied to results from a first-year mathematics exam with single-choice items. Several categorical, ordered, and numeric covariates like gender, prior experience, and prior mathematics knowledge are available to form potential subgroups with differential item functioning. Specifically, all analyses are demonstrated with a hands-on R tutorial using the psycho* family of R packages (psychotools, psychotree, psychomix) which provide a unified approach to estimating, visualizing, testing, mixing, and partitioning a range of psychometric models. The paper is dedicated to the memory of Fritz Leisch (1968-2024) and his contributions to various aspects of this work are highlighted.
翻译:许多统计学家定期为来自商业与经济、社会科学与心理学等其他领域的学生讲授统计学、概率论或数学等大型讲座课程。相应的考试通常采用多项选择或单项选择形式,并通过扫描纸质试卷或在线学习管理系统进行自动评估与评分。尽管对这些考试进行深入分析具有重要意义,但此类分析往往未能实施。例如,可利用心理测量学中的项目反应理论(IRT)模型建立衡量试题(或项目)难度与学生(或被试)能力的测量量表。此外,基于此类模型可评估考试是否真正对所有参与者公平,或特定试题是否对某些学生亚群更易(或更难)。本文讨论并应用了Rasch IRT模型中评估测量不变性与检测差异项目功能的若干新方法,将其应用于包含单项选择题的一年级数学考试结果。研究利用性别、先验经验与先验数学知识等分类、有序及数值协变量构建可能存在差异项目功能的潜在亚群。具体而言,所有分析均通过实践性R语言教程进行演示,该教程使用psycho*系列R包(psychotools、psychotree、psychomix),这些工具包为估计、可视化、检验、混合与划分各类心理测量模型提供了统一框架。本文谨以此纪念Fritz Leisch(1968-2024),并特别强调他在本研究多方面工作中所作出的贡献。