Inter-rater reliability (IRR), which is a prerequisite of high-quality ratings and assessments, may be affected by contextual variables such as the rater's or ratee's gender, major, or experience. Identification of such heterogeneity sources in IRR is important for implementation of policies with the potential to decrease measurement error and to increase IRR by focusing on the most relevant subgroups. In this study, we propose a flexible approach for assessing IRR in cases of heterogeneity due to covariates by directly modeling differences in variance components. We use Bayes factors to select the best performing model, and we suggest using Bayesian model-averaging as an alternative approach for obtaining IRR and variance component estimates, allowing us to account for model uncertainty. We use inclusion Bayes factors considering the whole model space to provide evidence for or against differences in variance components due to covariates. The proposed method is compared with other Bayesian and frequentist approaches in a simulation study, and we demonstrate its superiority in some situations. Finally, we provide real data examples from grant proposal peer-review, demonstrating the usefulness of this method and its flexibility in the generalization of more complex designs.
翻译:评分者间信度(IRR)是高质量评分与评估的前提条件,可能受到评分者或受评者的性别、专业、经验等情境变量的影响。识别IRR中的此类异质性来源对于制定政策至关重要——通过聚焦最相关的子群体,这些政策有望降低测量误差并提升IRR。本研究提出一种灵活方法,通过直接对方差成分差异进行建模,评估协变量导致异质性情况下的IRR。我们利用贝叶斯因子选择最优模型,并建议使用贝叶斯模型平均作为获取IRR及方差成分估计的替代方案,从而考虑模型不确定性。我们采用涵盖整个模型空间的包含贝叶斯因子,以提供协变量导致方差成分差异的证据支持或反对。通过模拟研究将所提方法与其他贝叶斯与频率学派方法进行比较,证明其在某些情境下的优越性。最后,我们提供来自基金申请书同行评议的真实数据案例,展示该方法在复杂设计泛化中的实用性与灵活性。