We study the sublinear multivariate mean estimation problem in $d$-dimensional Euclidean space. Specifically, we aim to find the mean $\mu$ of a ground point set $A$, which minimizes the sum of squared Euclidean distances of the points in $A$ to $\mu$. We first show that a multiplicative $(1+\varepsilon)$ approximation to $\mu$ can be found with probability $1-\delta$ using $O(\varepsilon^{-1}\log \delta^{-1})$ many independent uniform random samples, and provide a matching lower bound. Furthermore, we give two sublinear time algorithms of optimal sample complexity for extracting a suitable approximate mean: 1. Our first algorithm is based on gradient descent and exploits properties of the geometric median to estimate the mean. It runs in time $O((\varepsilon^{-1}+\log \delta^{-1})\cdot \log \delta^{-1} \cdot d)$. 2. Our second algorithm leverages properties of empirical means order statistics as well as clustering to estimate the mean. This allows to decrease the running time to near-optimal, namely $O\left((\varepsilon^{-1}+\log^{\gamma}\delta^{-1})\cdot \log \delta^{-1} \cdot d\right)$ for any constant $\gamma>0$. Throughout our analysis, we also generalize the familiar median-of-means estimator to the multivariate case, showing that the geometric median of $\log \delta^{-1}$ empirical means well-estimates the mean $\mu$, which may be of independent interest.
翻译:我们研究$d$维欧几里得空间中的亚线性多元均值估计问题。具体而言,我们的目标是寻找基础点集$A$的均值$\mu$,该均值最小化$A$中点到$\mu$的欧几里得距离平方和。我们首先证明,以概率$1-\delta$,使用$O(\varepsilon^{-1}\log \delta^{-1})$个独立的均匀随机样本,可以找到一个乘性$(1+\varepsilon)$近似于$\mu$的估计量,并给出了匹配的下界。此外,我们提出了两种具有最优样本复杂度的亚线性时间算法来提取合适的近似均值:1. 我们的第一种算法基于梯度下降,并利用几何中位数的性质来估计均值。其运行时间为$O((\varepsilon^{-1}+\log \delta^{-1})\cdot \log \delta^{-1} \cdot d)$。2. 我们的第二种算法利用经验均值顺序统计量以及聚类的性质来估计均值。这允许将运行时间降低到近乎最优,即对于任意常数$\gamma>0$,运行时间为$O\left((\varepsilon^{-1}+\log^{\gamma}\delta^{-1})\cdot \log \delta^{-1} \cdot d\right)$。在整个分析过程中,我们还将熟悉的中位数-均值估计量推广到多元情况,证明了$\log \delta^{-1}$个经验均值的几何中位数能很好地估计均值$\mu$,这一结果可能具有独立的研究价值。