We demonstrate the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise. We assume we have sample access to examples $(x, y)$ where $y$ is a noisy measurement of $g(w^* \cdot x)$. In particular, \new{the noisy labels are of the form} $y = g(w^* \cdot x) + \xi + \epsilon$, where $\xi$ is the oblivious noise drawn independently of $x$ \new{and satisfies} $\Pr[\xi = 0] \geq o(1)$, and $\epsilon \sim \mathcal N(0, \sigma^2)$. Our goal is to accurately recover a \new{parameter vector $w$ such that the} function $g(w \cdot x)$ \new{has} arbitrarily small error when compared to the true values $g(w^* \cdot x)$, rather than the noisy measurements $y$. We present an algorithm that tackles \new{this} problem in its most general distribution-independent setting, where the solution may not \new{even} be identifiable. \new{Our} algorithm returns \new{an accurate estimate of} the solution if it is identifiable, and otherwise returns a small list of candidates, one of which is close to the true solution. Furthermore, we \new{provide} a necessary and sufficient condition for identifiability, which holds in broad settings. \new{Specifically,} the problem is identifiable when the quantile at which $\xi + \epsilon = 0$ is known, or when the family of hypotheses does not contain candidates that are nearly equal to a translated $g(w^* \cdot x) + A$ for some real number $A$, while also having large error when compared to $g(w^* \cdot x)$. This is the first \new{algorithmic} result for GLM regression \new{with oblivious noise} which can handle more than half the samples being arbitrarily corrupted. Prior work focused largely on the setting of linear regression, and gave algorithms under restrictive assumptions.
翻译:我们提出了在加性无意识噪声存在下,针对广义线性模型(GLMs)回归问题的首个算法。假设我们可以对样本$(x, y)$进行采样,其中$y$是$g(w^* \cdot x)$的带噪声测量值。具体而言,新形式的带噪标签为$y = g(w^* \cdot x) + \xi + \epsilon$,这里$\xi$是与$x$独立提取的无意识噪声,并满足$\Pr[\xi = 0] \geq o(1)$,而$\epsilon \sim \mathcal N(0, \sigma^2)$。我们的目标是精确恢复一个参数向量$w$,使得函数$g(w \cdot x)$在与真实值$g(w^* \cdot x)$比较时,误差可任意小,而非与带噪测量值$y$比较。我们提出了一种算法,在最一般的、不依赖于数据分布的设置中解决此问题,其中解甚至可能不可辨识。我们的算法在解可辨识时返回一个精确估计,否则返回一个包含少量候选解的小列表,其中至少有一个接近真实解。此外,我们提供了可辨识性的充要条件,该条件在广泛情况下成立。具体而言,当$\xi + \epsilon = 0$的分位数已知,或当假设族不包含接近于经实数$A$平移后的$g(w^* \cdot x) + A$的候选解(同时与$g(w^* \cdot x)$相比具有较大误差)时,问题可辨识。这是首个能够处理过半样本被任意破坏的无意识噪声下GLM回归的算法性结果。先前的工作主要集中在线性回归场景,并在严格的假设下给出了算法。