Self-supervised learning (SSL) faces a fundamental conflict between semantic understanding and image reconstruction. High-level semantic SSL (e.g., DINO) relies on global tokens that are forced to be location-invariant for augmentation alignment, a process that inherently discards the spatial coordinates required for reconstruction. Conversely, generative SSL (e.g., MAE) preserves dense feature grids for reconstruction but fails to produce high-level abstractions. We introduce STELLAR, a framework that resolves this tension by factorizing visual features into a low-rank product of semantic concepts and their spatial distributions. This disentanglement allows us to perform DINO-style augmentation alignment on the semantic tokens while maintaining the precise spatial mapping in the localization matrix necessary for pixel-level reconstruction. We demonstrate that as few as 16 sparse tokens under this factorized form are sufficient to simultaneously support high-quality reconstruction (2.60 FID) and match the semantic performance of dense backbones (79.10% ImageNet accuracy). Our results highlight STELLAR as a versatile sparse representation that bridges the gap between discriminative and generative vision by strategically separating semantic identity from spatial geometry. Code available at https://aka.ms/stellar.
翻译:自监督学习(SSL)面临语义理解与图像重建之间的根本性冲突。高层语义SSL(如DINO)依赖全局标记,这些标记因数据增强对齐过程被迫具有位置不变性,该过程本质上丢弃了重建所需的空间坐标信息。反之,生成式SSL(如MAE)虽保留了用于重建的密集特征网格,却无法产生高层抽象表示。我们提出STELLAR框架,通过将视觉特征分解为语义概念与其空间分布的低秩乘积,从而化解这一矛盾。这种解耦允许我们在语义标记上执行DINO风格的数据增强对齐,同时在定位矩阵中保持像素级重建所需的精确空间映射。我们证明,在此分解形式下仅需16个稀疏标记即可同时支持高质量重建(FID为2.60)并匹配密集骨干网络的语义性能(ImageNet准确率79.10%)。我们的研究凸显STELLAR作为一种通用稀疏表示,通过策略性地分离语义身份与空间几何,弥合了判别式与生成式视觉任务之间的鸿沟。代码发布于https://aka.ms/stellar。