A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-N recommendation and large search evaluation data, most tests would have a 100% chance of finding statistically significant results. Therefore, the effect size should be used to determine practical or scientific significance.
翻译:已有许多信息检索研究评估了哪些统计技术适用于系统比较,但这些研究主要聚焦于TREC风格实验(通常涉及少于100个主题)。目前缺乏针对大规模搜索与推荐实验的类似研究——此类实验通常包含数千个主题或用户,且相关性判断数据极为稀疏,因此传统TREC实验的分析建议是否适用于这些场景尚不明确。本文通过实证研究,分析了显著性检验在大规模搜索与推荐评估数据中的表现。结果表明,在大样本量条件下,Wilcoxon检验和符号检验的I类错误率显著高于bootstrap、随机化检验和t检验,而后三者与预期错误率更为一致。虽然小样本量下各统计检验的统计功效存在差异,但大样本量下所有检验的统计功效无显著区别。我们建议不应使用符号检验和Wilcoxon检验分析大规模评估结果。研究表明,当使用Top-N推荐与大规模搜索评估数据时,大多数检验有100%的概率发现统计显著结果。因此,应使用效应量来判断实际或科学显著性。