Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters without increasing the model size, enabling both performance and memory efficiency. Additionally, we propose a novel positional disentanglement strategy that explicitly removes positional biases from dense representations, thereby improving the encoding of semantic content. This leads to consistent gains on several downstream benchmarks, demonstrating the utility of cleaner feature spaces. Our contributions establish a new standard for transparent, high-performance vision models and open a path toward more reproducible and generalizable foundation models for the broader AI community. The code and model checkpoints are available at https://github.com/valeoai/Franca.

翻译：我们提出Franca（发音为Fran-ka）：首个完全开源（数据、代码、权重）的视觉基础模型，其性能在多个场景下达到甚至超越了当前最先进的专有模型（如DINOv2、CLIP、SigLIPv2等）。我们的方法基于受Web-SSL启发的透明训练流程，并采用公开可用的数据：ImageNet-21K和ReLAION-2B的子集。除模型发布外，我们还解决了自监督学习聚类方法中的关键局限。尽管现代模型依赖Sinkhorn-Knopp等聚类算法将图像特征分配到大型码本中，但这些方法未能考虑聚类语义固有的模糊性。为此，我们引入了一种基于嵌套Matryoshka表征的参数高效多头部聚类投影器。该设计在不增加模型规模的前提下，将特征逐步细化为粒度更精细的聚类，实现了性能与内存效率的双重提升。此外，我们提出了一种新颖的位置解耦策略，显式地从稠密表征中去除位置偏差，从而提升语义内容的编码质量。这一改进在多个下游基准测试中带来了持续的性能增益，证明了更纯净特征空间的有效性。我们的贡献为透明、高性能视觉模型设立了新标准，并为更广泛的AI社区开辟了通向更具可复现性和泛化能力的基础模型之路。代码与模型检查点已发布于https://github.com/valeoai/Franca。