We explore whether the human ratings of open ended responses can be explained with non-content related features, and if such effects vary across different mathematics-related items. When scoring is rigorously defined and rooted in a measurement framework, educators intend that the features of a response which are indicative of the respondent's level of ability are contributing to scores. However, we find that features such as response length, a grammar score of the response, and a metric relating to key phrase frequency are significant predictors for response ratings. Although our findings are not causally conclusive, they may propel us to be more critical of he way in which we assess open ended responses, especially in high stakes scenarios. Educators take great care to provide unbiased, consistent ratings, but it may be that extraneous features unrelated to those which were intended to be rated are being evaluated.
翻译:我们探究开放式回答的人类评分是否可以通过非内容相关特征来解释,以及此类效应是否在不同数学相关题目间存在差异。当评分严格定义并植根于测量框架时,教育者希望回答中能体现作答者能力水平的特征有助于评分。然而,我们发现回答长度、回答的语法得分以及与关键词频率相关的指标等特征,是回答评分的显著预测因子。尽管我们的发现并非因果结论,但它们可能促使我们更加批判性地审视开放式回答的评估方式,尤其是在高风险场景中。教育者竭力提供公正、一致的评分,但可能那些与预期评估无关的额外特征正在被评价。