Multiple choice exams are widely used to assess candidates across a diverse range of domains and tasks. To moderate question quality, newly proposed questions often pass through pre-test evaluation stages before being deployed into real-world exams. Currently, this evaluation process is manually intensive, which can lead to time lags in the question development cycle. Streamlining this process via automation can significantly enhance efficiency, however, there's a current lack of datasets with adequate pre-test analysis information. In this paper we introduce CamChoice; a multiple-choice comprehension dataset of questions at different target levels, with corresponding candidate selection distributions. We introduce the task of candidate distribution matching, propose several evaluation metrics for the task, and demonstrate that automatic systems trained on RACE++ can be leveraged as baselines for our task. We further demonstrate that these automatic systems can be used for practical pre-test evaluation tasks such as detecting underperforming distractors, where our detection systems can automatically identify poor distractors that few candidates select. We release the data publicly for future research.
翻译:多项选择题被广泛应用于跨领域和任务中评估候选人。为保障题目质量,新拟题目通常在投入真实考试前需经过预测试评估环节。目前这一评估过程依赖大量人工操作,易导致题目开发周期出现延迟。通过自动化手段简化流程可显著提升效率,但当前仍缺乏包含充分预测试分析信息的公开数据集。本文提出CamChoice——一个包含不同难度层级选择题及其对应候选选择分布的多项选择理解数据集。我们定义了候选分布匹配任务,提出多项评估指标,并证明基于RACE++训练的自动系统可作为该任务的基线模型。进一步研究表明,这些自动系统可用于实用化预测试评估任务,例如检测表现不佳的干扰项——我们的检测系统能够自动识别候选者极少选择的低质干扰项。我们已公开该数据集以促进后续研究。