基于直接观测比例的诊断概率与统计似然比较：理解并可能克服可重复性危机 (Comparing statistical likelihoods with diagnostic probabilities based on directly observed proportions to help understand and perhaps overcome the replication crisis)

2024 年 10 月 4 日

Comparing statistical likelihoods with diagnostic probabilities based on directly observed proportions to help understand and perhaps overcome the replication crisis

翻译：基于直接观测比例的诊断概率与统计似然比较：理解并可能克服可重复性危机

Huw Llewelyn

from arxiv, 11 pages. 2 figures. 1 table

Diagnosticians use an observed proportion as a direct estimate of the posterior probability of a diagnosis. Therefore, a diagnostician regards a continuous Gaussian distribution of possible numerical outcomes conditional on the information in the study methods and data as probabilities (not likelihoods). Similarly, they might regard the distribution of possible means based on a SEM as a posterior probability distribution too. If the converse likelihood distribution of the observed mean conditional on any hypothetical mean (e.g. the null hypothesis) is assumed to be the same as the above posterior distribution (as is customary) then by Bayes rule, the prior distribution of all possible hypothetical means is uniform. It follows that the probability Q of any theoretically true mean falling into a tail beyond a null hypothesis would be equal to that tails area as a proportion of the whole. It also follows that the P value (the probability of the observed mean or something more extreme conditional on the null hypothesis) is equal to Q. Replication involves doing two independent studies, thus doubling the variance for the combined posterior probability distribution. So, if the original effect size was 1.96, the number of observations was 100, the SEM was 1 and the original P value was 0.025, the theoretical probability of a replicating study getting a P value of up to 0.025 again is only 0.283. By applying this double variance to achieve a power of 80%, the required number of observations is doubled compared to conventional approaches. If some replicating study is to achieve a P value of up to 0.025 yet again with a probability of 0.8, then this requires 3 times as many observations in the power calculation. This might explain the replication crisis.

翻译：诊断医师将观测到的比例直接视为诊断后验概率的估计值。因此，诊断医师将基于研究方法和数据信息得到的连续高斯分布视为概率（而非似然）。类似地，他们也可能将基于标准误的可能均值分布视为后验概率分布。若假定观测均值在任意假设均值（如零假设）条件下的逆似然分布与上述后验分布相同（此为常规做法），则根据贝叶斯定理，所有可能假设均值的先验分布为均匀分布。由此可得：任何理论真值均值落入零假设之外尾部的概率Q等于该尾部面积占总面积的比例，且P值（在零假设条件下观测均值或更极端情况出现的概率）等于Q。可重复性研究需进行两项独立实验，这将使合并后验概率分布的方差翻倍。因此，若原始效应值为1.96、观测次数为100、标准误为1且原始P值为0.025，则重复研究再次获得不超过0.025的P值的理论概率仅为0.283。通过应用这种双方差方法以达到80%的统计功效，所需观测次数较传统方法需增加一倍。若要使某项重复研究以0.8的概率再次获得不超过0.025的P值，则功效计算中需要的观测次数需增至三倍。这或许可解释当前的可重复性危机。