Understanding Learning with Sliced-Wasserstein Requires Rethinking Informative Slices

The practical applications of Wasserstein distances (WDs) are constrained by their sample and computational complexities. Sliced-Wasserstein distances (SWDs) provide a workaround by projecting distributions onto one-dimensional subspaces, leveraging the more efficient, closed-form WDs for one-dimensional distributions. However, in high dimensions, most random projections become uninformative due to the concentration of measure phenomenon. Although several SWD variants have been proposed to focus on \textit{informative} slices, they often introduce additional complexity, numerical instability, and compromise desirable theoretical (metric) properties of SWD. Amidst the growing literature that focuses on directly modifying the slicing distribution, which often face challenges, we revisit the classical Sliced-Wasserstein and propose instead to rescale the 1D Wasserstein to make all slices equally informative. Importantly, we show that with an appropriate data assumption and notion of \textit{slice informativeness}, rescaling for all individual slices simplifies to \textbf{a single global scaling factor} on the SWD. This, in turn, translates to the standard learning rate search for gradient-based learning in common machine learning workflows. We perform extensive experiments across various machine learning tasks showing that the classical SWD, when properly configured, can often match or surpass the performance of more complex variants. We then answer the following question: "Is Sliced-Wasserstein all you need for common learning tasks?"

翻译：Wasserstein距离的实际应用受到其样本复杂度和计算复杂度的限制。切片Wasserstein距离通过将分布投影到一维子空间上，利用一维分布Wasserstein距离的高效闭式解，提供了可行的替代方案。然而，在高维空间中，由于测度集中现象，大多数随机投影变得缺乏信息量。尽管已有多种SWD变体被提出以聚焦于\textit{信息性}切片，但这些方法通常引入了额外的复杂性、数值不稳定性，并可能损害SWD原有的理想理论（度量）性质。在不断涌现的文献主要关注直接修改切片分布（这常常面临挑战）的背景下，我们重新审视经典的切片Wasserstein距离，并提出通过重新缩放一维Wasserstein距离来使所有切片同等信息化的方法。重要的是，我们证明，在适当的数据假设和\textit{切片信息性}定义下，对所有单个切片的缩放可简化为对SWD的\textbf{单一全局缩放因子}。这进而转化为常见机器学习工作流中基于梯度学习所需的标准学习率搜索。我们在多种机器学习任务上进行了广泛实验，结果表明，经典SWD在适当配置下，其性能往往能与更复杂的变体相媲美甚至更优。我们由此回答以下问题：“对于常见学习任务，切片Wasserstein是否已足够？”