We study the fundamental problem of estimating the mean of a $d$-dimensional distribution with covariance $\Sigma \preccurlyeq \sigma^2 I_d$ given $n$ samples. When $d = 1$, Catoni \cite{catoni} showed an estimator with error $(1+o(1)) \cdot \sigma \sqrt{\frac{2 \log \frac{1}{\delta}}{n}}$, with probability $1 - \delta$, matching the Gaussian error rate. For $d>1$, a natural estimator outputs the center of the minimum enclosing ball of one-dimensional confidence intervals to achieve a $1-\delta$ confidence radius of $\sqrt{\frac{2 d}{d+1}} \cdot \sigma \left(\sqrt{\frac{d}{n}} + \sqrt{\frac{2 \log \frac{1}{\delta}}{n}}\right)$, incurring a $\sqrt{\frac{2d}{d+1}}$-factor loss over the Gaussian rate. When the $\sqrt{\frac{d}{n}}$ term dominates by a $\sqrt{\log \frac{1}{\delta}}$ factor, \cite{lee2022optimal-highdim} showed an improved estimator matching the Gaussian rate. This raises a natural question: is the Gaussian rate achievable in general? Or is the $\sqrt{\frac{2 d}{d+1}}$ loss \emph{necessary} when the $\sqrt{\frac{2 \log \frac{1}{\delta}}{n}}$ term dominates? We show that the answer to both these questions is \emph{no} -- we show that \emph{some} constant-factor loss over the Gaussian rate is necessary, but construct an estimator that improves over the above naive estimator by a constant factor. We also consider robust estimation, where an adversary is allowed to corrupt an $\epsilon$-fraction of samples arbitrarily: in this case, we show that the above strategy of combining one-dimensional estimates and incurring the $\sqrt{\frac{2d}{d+1}}$-factor \emph{is} optimal in the infinite-sample limit.
翻译:我们研究基于n个样本估计d维分布均值的基本问题,其中协方差满足Σ ≼ σ²I_d。当d=1时,Catoni \cite{catoni}展示了估计器误差为(1+o(1))·σ√(2log(1/δ)/n),概率为1-δ,与高斯误差率匹配。对于d>1,一种自然估计器通过输出一维置信区间的最小包围球中心,实现1-δ置信半径为√(2d/(d+1))·σ(√(d/n)+√(2log(1/δ)/n)),相对高斯率产生√(2d/(d+1))倍的损失。当√(d/n)项被√log(1/δ)因子主导时,\cite{lee2022optimal-highdim}展示了匹配高斯率的改进估计器。这引发一个自然问题:高斯率是否普遍可达?或者当√(2log(1/δ)/n)项主导时,√(2d/(d+1))的损失是否必要?我们证明这两个问题的答案均为否——我们表明相对于高斯率的某些常数因子损失是必要的,但构建了一个相对上述朴素估计器有常数因子改进的估计器。我们同时考虑稳健估计情形:允许敌手任意污染ε比例的样本,此时我们证明上述结合一维估计并承受√(2d/(d+1))因子损失的策略在无限样本极限下是最优的。