Beyond Point Estimates: Toward Proper Statistical Inferencing and Reporting of Intraclass Correlation Coefficients

Reporting test-retest reliability using the intraclass correlation coefficient (ICC) has received increasing attention due to the criticisms of poor transparency and replicability in neuroimaging research, as well as many other biomedical studies. Numerous studies have thus evaluated the reliability of their findings by comparing ICCs, however, they often failed to test statistical differences between ICCs or report confidence intervals. Relying solely on point estimates may preclude valid inference about population-level differences and compromise the reliability of conclusions. To address this issue, this study systematically reviewed the use of ICC in articles published in NeuroImage from 2022 to 2024, highlighting the prevalence of misreporting and misuse of ICCs. We further provide practical guidelines for conducting appropriate statistical inference on ICCs. For practitioners in this area, we introduce an online application for statistical testing and sample size estimation when utilizing ICCs. We recalculated confidence intervals and formally tested ICC values reported in the reviewed articles, thereby reassessing the original inferences. Our results demonstrate that exclusive reliance on point estimates could lead to unreliable or even misleading conclusions. Specifically, only two of the eleven reviewed articles provided unequivocally valid statistical inferences based on ICCs, whereas two articles failed to yield any valid inference at all, raising serious concerns about the replicability of findings in this field. These results underscore the urgent need for rigorous inferential frameworks when reporting and interpreting ICCs.

翻译：在神经影像学研究以及许多其他生物医学研究中，由于透明度和可重复性不足的批评，使用组内相关系数（ICC）报告重测信度受到越来越多的关注。因此，许多研究通过比较ICC来评估其发现的可靠性，然而，它们往往未能检验ICC之间的统计差异或报告置信区间。仅依赖点估计可能会妨碍对群体层面差异的有效推断，并损害结论的可靠性。为解决这一问题，本研究系统回顾了2022年至2024年间发表在《NeuroImage》上的文章中对ICC的使用情况，重点指出了ICC误报和误用的普遍性。我们进一步为对ICC进行适当的统计推断提供了实用指南。针对该领域的实践者，我们介绍了一个在线应用程序，用于在使用ICC时进行统计检验和样本量估计。我们重新计算了所回顾文章中报告的ICC值的置信区间并进行了正式检验，从而重新评估了原始的推断。我们的结果表明，仅依赖点估计可能导致不可靠甚至误导性的结论。具体而言，在回顾的十一篇文章中，仅有两篇基于ICC提供了明确有效的统计推断，而有两篇文章则未能得出任何有效推断，这引发了对该领域研究发现可重复性的严重担忧。这些结果强调了在报告和解释ICC时迫切需要严格的推断框架。