In the big data era, the need to reevaluate traditional statistical methods is paramount due to the challenges posed by vast datasets. While larger samples theoretically enhance accuracy and hypothesis testing power without increasing false positives, practical concerns about inflated Type-I errors persist. The prevalent belief is that larger samples can uncover subtle effects, necessitating dual consideration of p-value and effect size. Yet, the reliability of p-values from large samples remains debated. This paper warns that larger samples can exacerbate minor issues into significant errors, leading to false conclusions. Through our simulation study, we demonstrate how growing sample sizes amplify issues arising from two commonly encountered violations of model assumptions in real-world data and lead to incorrect decisions. This underscores the need for vigilant analytical approaches in the era of big data. In response, we introduce a permutation-based test to counterbalance the effects of sample size and assumption discrepancies by neutralizing them between actual and permuted data. We demonstrate that this approach effectively stabilizes nominal Type I error rates across various sample sizes, thereby ensuring robust statistical inferences even amidst breached conventional assumptions in big data. For reproducibility, our R codes are publicly available at: \url{https://github.com/ubcxzhang/bigDataIssue}.
翻译:在大数据时代,由于海量数据集带来的挑战,重新评估传统统计方法的需求至关重要。尽管理论上大样本能在不增加假阳性率的情况下提升准确性和假设检验功效,但关于第一类错误膨胀的实际担忧仍然存在。普遍观点认为大样本能揭示细微效应,因此需要同时考虑p值和效应量。然而,大样本p值的可靠性仍存争议。本文警示:大样本可能将微小问题放大为显著错误,导致错误结论。通过模拟研究,我们展示了在真实数据中两种常见模型假设违规如何随样本量增大而加剧问题,并引发错误决策。这凸显了大数据时代需采取审慎分析方法的迫切性。为此,我们引入一种基于置换的检验方法,通过在实际数据与置换数据之间中和样本量及假设差异的影响,来平衡由此产生的偏差。我们证明该方法能有效稳定不同样本量下的名义第一类错误率,从而在传统假设被违反的大数据场景中确保统计推断的稳健性。为保障可重复性,本研究的R代码已发布于:\url{https://github.com/ubcxzhang/bigDataIssue}。