GLM Regression with Oblivious Corruptions

We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, \new{the noisy labels are of the form} $y = g(w^* \cdot x) + \xi + \epsilon$, where $\xi$ is the oblivious noise drawn independently of $x$ \new{and satisfies} $\Pr[\xi = 0] \geq o(1)$, and $\epsilon \sim \mathcal N(0, \sigma^2)$. Our goal is to accurately recover a \new{parameter vector $w$ such that the} function $g(w \cdot x)$ \new{has} arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles \new{this} problem in its most general distribution-independent setting, where the solution may not \new{even} be identifiable. \new{Our} algorithm returns \new{an accurate estimate of} the solution if it is identifiable, and otherwise returns a small list of candidates, one of which is close to the true solution. Furthermore, we \new{provide} a necessary and sufficient condition for identifiability, which holds in broad settings. \new{Specifically,} the problem is identifiable when the quantile at which $\xi + \epsilon = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first \new{algorithmic} result for GLM regression \new{with oblivious noise} which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression, and gave algorithms under restrictive assumptions.

翻译：我们提出了在加性无意识噪声存在下，针对广义线性模型（GLMs）回归问题的首个算法。假设我们可以对样本$(x, y)$进行采样，其中$y$是$g(w^* \cdot x)$的带噪声测量值。具体而言，新形式的带噪标签为$y = g(w^* \cdot x) + \xi + \epsilon$，这里$\xi$是与$x$独立提取的无意识噪声，并满足$\Pr[\xi = 0] \geq o(1)$，而$\epsilon \sim \mathcal N(0, \sigma^2)$。我们的目标是精确恢复一个参数向量$w$，使得函数$g(w \cdot x)$在与真实值$g(w^* \cdot x)$比较时，误差可任意小，而非与带噪测量值$y$比较。我们提出了一种算法，在最一般的、不依赖于数据分布的设置中解决此问题，其中解甚至可能不可辨识。我们的算法在解可辨识时返回一个精确估计，否则返回一个包含少量候选解的小列表，其中至少有一个接近真实解。此外，我们提供了可辨识性的充要条件，该条件在广泛情况下成立。具体而言，当$\xi + \epsilon = 0$的分位数已知，或当假设族不包含接近于经实数$A$平移后的$g(w^* \cdot x) + A$的候选解（同时与$g(w^* \cdot x)$相比具有较大误差）时，问题可辨识。这是首个能够处理过半样本被任意破坏的无意识噪声下GLM回归的算法性结果。先前的工作主要集中在线性回归场景，并在严格的假设下给出了算法。