The problem of detecting fake data inspires the following seemingly simple mathematical question. Sample a data point $X$ from the standard normal distribution in $\mathbb{R}^n$. An adversary observes $X$ and corrupts it by adding a vector $rt$, where they can choose any vector $t$ from a fixed set $T$ of the adversary's "tricks", and where $r>0$ is a fixed radius. The adversary's choice of $t=t(X)$ may depend on the true data $X$. The adversary wants to hide the corruption by making the fake data $X+rt$ statistically indistinguishable from the real data $X$. What is the largest radius $r=r(T)$ for which the adversary can create an undetectable fake? We show that for highly symmetric sets $T$, the detectability radius $r(T)$ is approximately twice the scaled Gaussian width of $T$. The upper bound actually holds for arbitrary sets $T$ and generalizes to arbitrary, non-Gaussian distributions of real data $X$. The lower bound may fail for not highly symmetric $T$, but we conjecture that this problem can be solved by considering the focused version of the Gaussian width of $T$, which focuses on the most important directions of $T$.
翻译:检测伪造数据的问题引出了一个看似简单的数学问题。从 $\mathbb{R}^n$ 中的标准正态分布中采样一个数据点 $X$。对手观察到 $X$ 并通过添加向量 $rt$ 来破坏它,其中他们可以从对手的“技巧”固定集合 $T$ 中选择任意向量 $t$,且 $r>0$ 是一个固定半径。对手对 $t=t(X)$ 的选择可能依赖于真实数据 $X$。对手希望通过使伪造数据 $X+rt$ 在统计上与真实数据 $X$ 无法区分来隐藏破坏行为。对手能够创建不可检测伪造数据的最大半径 $r=r(T)$ 是多少?我们证明,对于高度对称的集合 $T$,可检测半径 $r(T)$ 大约为 $T$ 的缩放高斯宽度的两倍。实际上,上界对于任意集合 $T$ 均成立,并且可以推广到真实数据 $X$ 的任意非高斯分布。下界对于非高度对称的集合 $T$ 可能不成立,但我们推测,通过考虑 $T$ 的聚焦版本的高斯宽度——该版本聚焦于 $T$ 的最重要方向——可以解决此问题。