A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-N recommendation and large search evaluation data, most tests would have a 100% chance of finding statistically significant results. Therefore, the effect size should be used to determine practical or scientific significance.
翻译:已有大量信息检索研究评估了哪些统计技术适用于比较系统。然而,这些研究主要聚焦于TREC风格实验,其主题通常少于100个。对于大规模搜索与推荐实验,尚无类似的研究工作;此类实验通常涉及数千个主题或用户,且相关性判断更为稀疏,因此传统TREC实验的分析建议是否适用于这些场景尚不明确。本文针对大规模搜索与推荐评估数据,实证研究了显著性检验的行为。结果表明,在大样本量下,Wilcoxon检验和符号检验的I类错误率显著高于自助法、随机化检验和t检验,而后者的错误率与预期更为一致。尽管统计检验在小样本量下的统计功效存在差异,但在大样本量下其功效无显著差异。我们建议不应使用符号检验和Wilcoxon检验分析大规模评估结果。结果还表明,对于Top-N推荐和大规模搜索评估数据,多数检验有100%的概率发现统计显著结果。因此,应使用效应量来判断实际或科学显著性。