This paper studies the problem of shuffled linear regression, where the correspondence between predictors and responses in a linear model is obfuscated by a latent permutation. Specifically, we consider the model $y = \Pi_* X \beta_* + w$, where $X$ is an $n \times d$ standard Gaussian design matrix, $w$ is Gaussian noise with entrywise variance $\sigma^2$, $\Pi_*$ is an unknown $n \times n$ permutation matrix, and $\beta_*$ is the regression coefficient, also unknown. Previous work has shown that, in the large $n$-limit, the minimal signal-to-noise ratio ($\mathsf{SNR}$), $\lVert \beta_* \rVert^2/\sigma^2$, for recovering the unknown permutation exactly with high probability is between $n^2$ and $n^C$ for some absolute constant $C$ and the sharp threshold is unknown even for $d=1$. We show that this threshold is precisely $\mathsf{SNR} = n^4$ for exact recovery throughout the sublinear regime $d=o(n)$. As a by-product of our analysis, we also determine the sharp threshold of almost exact recovery to be $\mathsf{SNR} = n^2$, where all but a vanishing fraction of the permutation is reconstructed.
翻译:本文研究洗牌线性回归问题,其中线性模型中预测变量与响应变量之间的对应关系因潜在排列而模糊。具体地,我们考虑模型 $y = \Pi_* X \beta_* + w$,其中 $X$ 为 $n \times d$ 标准高斯设计矩阵,$w$ 为逐元素方差为 $\sigma^2$ 的高斯噪声,$\Pi_*$ 为未知的 $n \times n$ 置换矩阵,$\beta_*$ 为同样未知的回归系数。先前研究表明,在大 $n$ 极限下,以高概率精确恢复未知置换所需的最小信噪比($\mathsf{SNR}$,即 $\lVert \beta_* \rVert^2/\sigma^2$)介于 $n^2$ 与 $n^C$ 之间(其中 $C$ 为某绝对常数),且即使在 $d=1$ 情形下其尖锐阈值仍未知。本文证明,在次线性区域 $d=o(n)$ 内,精确恢复的此阈值恰为 $\mathsf{SNR} = n^4$。作为分析副产品,我们还确定了几乎精确恢复的尖锐阈值为 $\mathsf{SNR} = n^2$,此时置换除可忽略分数外均被重构。