Distribution-Independent Regression for Generalized Linear Models with Oblivious Corruptions

We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, \new{the noisy labels are of the form} $y = g(w^* \cdot x) + \xi + \epsilon$, where $\xi$ is the oblivious noise drawn independently of $x$ \new{and satisfies} $\Pr[\xi = 0] \geq o(1)$, and $\epsilon \sim \mathcal N(0, \sigma^2)$. Our goal is to accurately recover a \new{parameter vector $w$ such that the} function $g(w \cdot x)$ \new{has} arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles \new{this} problem in its most general distribution-independent setting, where the solution may not \new{even} be identifiable. \new{Our} algorithm returns \new{an accurate estimate of} the solution if it is identifiable, and otherwise returns a small list of candidates, one of which is close to the true solution. Furthermore, we \new{provide} a necessary and sufficient condition for identifiability, which holds in broad settings. \new{Specifically,} the problem is identifiable when the quantile at which $\xi + \epsilon = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first \new{algorithmic} result for GLM regression \new{with oblivious noise} which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression, and gave algorithms under restrictive assumptions.

翻译：我们提出了在加性遗忘噪声存在下进行广义线性模型（GLM）回归问题的首批算法。假设我们可以对样本 $(x, y)$ 进行采样，其中 $y$ 是 $g(w^* \cdot x)$ 的含噪测量值。具体而言，噪声标签的形式为 $y = g(w^* \cdot x) + \xi + \epsilon$，其中 $\xi$ 是与 $x$ 独立抽取的遗忘噪声，且满足 $\Pr[\xi = 0] \geq o(1)$，而 $\epsilon \sim \mathcal N(0, \sigma^2)$。我们的目标是精确恢复参数向量 $w$，使得函数 $g(w \cdot x)$ 与真实值 $g(w^* \cdot x)$（而非含噪测量值 $y$）之间的误差任意小。我们提出了一种算法，在最一般的分布无关设定下处理此问题，其中解可能甚至不可识别。我们的算法在解可识别时返回其精确估计，否则返回一个包含少量候选解的小列表，其中至少有一个接近真实解。此外，我们给出了可识别性的充要条件，该条件在广泛场景下成立。具体而言，当 $\xi + \epsilon = 0$ 的分位数已知，或假设族中不包含与平移后的 $g(w^* \cdot x) + A$（$A$ 为实数）几乎相等、但与 $g(w^* \cdot x)$ 相比误差很大的候选解时，该问题是可识别的。这是首个能够处理超过一半样本被任意破坏的GLM回归遗忘噪声的算法结果。先前的工作主要集中于线性回归设定，并在严格假设下给出了算法。