Optimal estimation of the null distribution in large-scale inference

The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a Gaussian model for $n$ many $z$-scores with at most $k < \frac{n}{2}$ nonnulls, Efron suggests estimating the location and scale parameters of the null distribution. Placing no assumptions on the nonnull effects, the statistical task can be viewed as a robust estimation problem. However, the best known robust estimators fail to be consistent in the regime $k \asymp n$ which is especially relevant in large-scale inference. The failure of estimators which are minimax rate-optimal with respect to other formulations of robustness (e.g. Huber's contamination model) might suggest the impossibility of consistent estimation in this regime and, consequently, a major weakness of Efron's suggestion. A sound evaluation of Efron's model thus requires a complete understanding of consistency. We sharply characterize the regime of $k$ for which consistent estimation is possible and further establish the minimax estimation rates. It is shown consistent estimation of the location parameter is possible if and only if $\frac{n}{2} - k = \omega(\sqrt{n})$, and consistent estimation of the scale parameter is possible in the entire regime $k < \frac{n}{2}$. Faster rates than those in Huber's contamination model are achievable by exploiting the Gaussian character of the data. The minimax upper bound is obtained by considering estimators based on the empirical characteristic function. The minimax lower bound involves constructing two marginal distributions whose characteristic functions match on a wide interval containing zero. The construction notably differs from those in the literature by sharply capturing a scaling of $n-2k$ in the minimax estimation rate of the location.

翻译：大规模推断的出现促使人们对传统统计思想进行重新审视。在一个包含$n$个$z$得分的正态模型中，其中至多有$k < \frac{n}{2}$个非零效应，Efron建议估计零分布的位置和尺度参数。在不假设非零效应的情况下，该统计任务可视为稳健估计问题。然而，在$k \asymp n$的区间内（该区间对大规模推断尤为重要），已知的最优稳健估计量无法保持一致性。这些估计量在稳健性其他框架（如Huber污染模型）下具有极小极大速率最优性，其失败可能暗示在该区间内无法实现一致估计，从而暴露Efron建议的主要缺陷。因此，对Efron模型的合理评估需要完全理解一致性。我们精确刻画了可实现一致估计的$k$的区间，并进一步建立了极小极大估计速率。研究表明，位置参数的一致估计当且仅当$\frac{n}{2} - k = \omega(\sqrt{n})$时可行，而尺度参数的一致估计在$k < \frac{n}{2}$的整个区间内均成立。通过利用数据的高斯特性，可达到比Huber污染模型更快的速率。极小极大上界通过考虑基于经验特征函数的估计量获得。极小极大下界涉及构造两个边际分布，其特征函数在包含零的宽区间上匹配。该构造显著区别于文献中的现有方法，通过精确捕捉位置参数极小极大估计速率中$n-2k$的缩放关系来体现创新性。