In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.
翻译:在截断线性回归中,仅当结果$y$落入某个未知生存集$S^\star$内时,样本$(x,y)$才会被观测到,目标在于估计未知的$d$维权回归向量$w^\star$。该问题在统计学与机器学习领域历史悠久,可追溯至(Galton, 1897; Tobin, 1958)的研究,近期亦见于(Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024)等工作。然而,尽管有着长期的研究积淀,大部分现有工作局限于$S^\star$精确已知的特殊情形。更具实际相关性的情况——即$S^\star$未知且需从数据中学习——仍悬而未决:当前唯一可用的算法要么对特征向量分布施加严格假设(如高斯性),即便在此条件下,其达到$\varepsilon$精度的运行时间仍为$d^{\mathrm{poly} (1/\varepsilon)}$。本文提出首个针对未知生存集截断线性回归的算法,仅要求特征向量满足次高斯性,即可在$\mathrm{poly} (d/\varepsilon)$时间内完成计算。该算法依赖于一项新颖的子程序,能在特定光滑性条件下,通过仅利用正例(无负例)高效学习有界数量区间的并集。这一学习保证拓展了正例仅有的PAC学习理论框架,可能具有独立研究价值。