We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, \new{the noisy labels are of the form} $y = g(w^* \cdot x) + \xi + \epsilon$, where $\xi$ is the oblivious noise drawn independently of $x$ \new{and satisfies} $\Pr[\xi = 0] \geq o(1)$, and $\epsilon \sim \mathcal N(0, \sigma^2)$. Our goal is to accurately recover a \new{parameter vector $w$ such that the} function $g(w \cdot x)$ \new{has} arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles \new{this} problem in its most general distribution-independent setting, where the solution may not \new{even} be identifiable. \new{Our} algorithm returns \new{an accurate estimate of} the solution if it is identifiable, and otherwise returns a small list of candidates, one of which is close to the true solution. Furthermore, we \new{provide} a necessary and sufficient condition for identifiability, which holds in broad settings. \new{Specifically,} the problem is identifiable when the quantile at which $\xi + \epsilon = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first \new{algorithmic} result for GLM regression \new{with oblivious noise} which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression, and gave algorithms under restrictive assumptions.
翻译:我们提出了在加性遗忘噪声存在下进行广义线性模型(GLM)回归问题的首批算法。假设我们可以对样本 $(x, y)$ 进行采样,其中 $y$ 是 $g(w^* \cdot x)$ 的含噪测量值。具体而言,噪声标签的形式为 $y = g(w^* \cdot x) + \xi + \epsilon$,其中 $\xi$ 是与 $x$ 独立抽取的遗忘噪声,且满足 $\Pr[\xi = 0] \geq o(1)$,而 $\epsilon \sim \mathcal N(0, \sigma^2)$。我们的目标是精确恢复参数向量 $w$,使得函数 $g(w \cdot x)$ 与真实值 $g(w^* \cdot x)$(而非含噪测量值 $y$)之间的误差任意小。我们提出了一种算法,在最一般的分布无关设定下处理此问题,其中解可能甚至不可识别。我们的算法在解可识别时返回其精确估计,否则返回一个包含少量候选解的小列表,其中至少有一个接近真实解。此外,我们给出了可识别性的充要条件,该条件在广泛场景下成立。具体而言,当 $\xi + \epsilon = 0$ 的分位数已知,或假设族中不包含与平移后的 $g(w^* \cdot x) + A$($A$ 为实数)几乎相等、但与 $g(w^* \cdot x)$ 相比误差很大的候选解时,该问题是可识别的。这是首个能够处理超过一半样本被任意破坏的GLM回归遗忘噪声的算法结果。先前的工作主要集中于线性回归设定,并在严格假设下给出了算法。