Online Robust Mean Estimation

We study the problem of high-dimensional robust mean estimation in an online setting. Specifically, we consider a scenario where $n$ sensors are measuring some common, ongoing phenomenon. At each time step $t=1,2,\ldots,T$, the $i^{th}$ sensor reports its readings $x^{(i)}_t$ for that time step. The algorithm must then commit to its estimate $\mu_t$ for the true mean value of the process at time $t$. We assume that most of the sensors observe independent samples from some common distribution $X$, but an $\epsilon$-fraction of them may instead behave maliciously. The algorithm wishes to compute a good approximation $\mu$ to the true mean $\mu^\ast := \mathbf{E}[X]$. We note that if the algorithm is allowed to wait until time $T$ to report its estimate, this reduces to the well-studied problem of robust mean estimation. However, the requirement that our algorithm produces partial estimates as the data is coming in substantially complicates the situation. We prove two main results about online robust mean estimation in this model. First, if the uncorrupted samples satisfy the standard condition of $(\epsilon,\delta)$-stability, we give an efficient online algorithm that outputs estimates $\mu_t$, $t \in [T],$ such that with high probability it holds that $\|\mu-\mu^\ast\|_2 = O(\delta \log(T))$, where $\mu = (\mu_t)_{t \in [T]}$. We note that this error bound is nearly competitive with the best offline algorithms, which would achieve $\ell_2$-error of $O(\delta)$. Our second main result shows that with additional assumptions on the input (most notably that $X$ is a product distribution) there are inefficient algorithms whose error does not depend on $T$ at all.

翻译：我们研究在线设置下的高维鲁棒均值估计问题。具体而言，考虑$n$个传感器正在测量某个共同的持续现象。在每个时间步$t=1,2,\ldots,T$，第$i$个传感器报告其读数$x^{(i)}_t$。算法必须据此在时间$t$给出真实过程均值的估计值$\mu_t$。我们假设大多数传感器从某个共同分布$X$中独立采样，但其中$\epsilon$比例的传感器可能表现出恶意行为。算法希望计算出对真实均值$\mu^\ast := \mathbf{E}[X]$的良好近似$\mu$。注意到，若允许算法等待至时间$T$再报告估计值，该问题便简化为已有充分研究的鲁棒均值估计问题。然而，要求算法在数据流入时实时生成部分估计值使得情况显著复杂化。我们在此模型下证明了关于在线鲁棒均值估计的两个主要结果。首先，若未被污染的样本满足$(\epsilon,\delta)$-稳定性的标准条件，我们提出一种高效在线算法，其输出的估计值$\mu_t$（$t \in [T]$）满足高概率下$\|\mu-\mu^\ast\|_2 = O(\delta \log(T))$，其中$\mu = (\mu_t)_{t \in [T]}$。值得注意的是，该误差界与可实现$\ell_2$误差$O(\delta)$的最优离线算法几乎相当。第二个主要结果表明，在输入具有额外假设（特别是$X$为乘积分布）时，存在误差与$T$完全无关的低效算法。