One of the dominant paradigms in self-supervised learning (SSL), illustrated by MoCo or DINO, aims to produce robust representations by capturing features that are insensitive to certain image transformations such as illumination, or geometric changes. This strategy is appropriate when the objective is to recognize objects independently of their appearance. However, it becomes counterproductive as soon as appearance itself constitutes the discriminative signal. In weather analysis, for example, rain streaks, snow granularity, atmospheric scattering, as well as reflections and halos, are not noise: they carry the essential information. In critical applications such as autonomous driving, ignoring these cues is risky, since grip and visibility depend directly on ground conditions and atmospheric conditions. We introduce ST-STORM, a hybrid SSL framework that treats appearance (style) as a semantic modality to be disentangled from content. Our architecture explicitly separates two latent streams, regulated by gating mechanisms. The Content branch aims at a stable semantic representation through a JEPA scheme coupled with a contrastive objective, promoting invariance to appearance variations. In parallel, the Style branch is constrained to capture appearance signatures (textures, contrasts, scattering) through feature prediction and reconstruction under an adversarial constraint. We evaluate ST-STORM on several tasks, including object classification (ImageNet-1K), fine-grained weather characterization, and melanoma detection (ISIC 2024 Challenge). The results show that the Style branch effectively isolates complex appearance phenomena (F1=97% on Multi-Weather and F1=94% on ISIC 2024 with 10% labeled data), without degrading the semantic performance (F1=80% on ImageNet-1K) of the Content branch, and improves the preservation of critical appearance
翻译:自我监督学习(SSL)的主流范式之一(以MoCo或DINO为代表)旨在通过捕获对某些图像变换(如光照或几何变化)不敏感的特征来生成鲁棒的表示。当目标是识别与外观无关的物体时,这一策略是合理的。然而,一旦外观本身构成判别性信号,该策略就会适得其反。例如,在天气分析中,雨痕、雪粒、大气散射以及反射和光晕并非噪声,而是承载着关键信息。在自动驾驶等关键应用中,忽略这些线索是有风险的,因为抓地力和能见度直接取决于地面条件和大气条件。我们提出ST-STORM,这是一种混合SSL框架,将外观(风格)视为需从内容中分离的语义模态。我们的架构明确分离出两个潜在流,并由门控机制调控。内容分支通过结合对比目标的JEPA方案,追求对外观变化保持不变的稳定语义表示。与此同时,风格分支在对抗约束下通过特征预测和重建来捕获外观特征(纹理、对比度、散射)。我们在多项任务上评估ST-STORM,包括物体分类(ImageNet-1K)、细粒度天气表征和黑色素瘤检测(ISIC 2024挑战赛)。结果表明,风格分支有效隔离复杂外观现象(在Multi-Weather数据集上F1=97%,在ISIC 2024数据集上使用10%标注数据F1=94%),且不降低内容分支的语义性能(ImageNet-1K上F1=80%),并改善了关键外观信息的保持。