Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
翻译:大多数自监督学习(SSL)方法通过对齐同一输入的不同视图来学习连续视觉表示,但难以控制信息在表示维度上的结构化方式。本工作将视觉自监督学习构建为教师网络与学生网络之间的离散通信过程,其中语义信息通过固定容量的二进制信道传输。学生网络不再对齐连续特征,而是预测教师网络生成的多标签二进制消息。通过逐元素的二进制交叉熵目标强制实现离散一致性,同时编码率正则化项促进对受限信道的有效利用,从而推动结构化表示的形成。我们进一步证明,周期性重初始化投影头能够增强这一效果,因为它鼓励嵌入表示在多种离散编码下保持预测能力。大量实验表明,在图像分类、检索、密集视觉预测任务以及通过自监督适应实现的域偏移场景中,该方法相较于连续一致性基线模型均取得稳定提升。除主干网络表示外,我们分析了学习到的二进制码,发现它们构成了一种紧凑且信息丰富的离散语言,能够捕获跨类别可复用的语义因子。