Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.
翻译:大多数自监督学习方法通过对齐同一输入的不同视图来学习连续视觉表征,而对信息在表征维度上的组织方式控制有限。本文将视觉自监督学习建模为教师网络与学生网络之间的离散通信过程,其中语义信息通过固定容量的二进制通道传输。与对齐连续特征不同,学生网络需要预测教师网络生成的多标签二进制消息。通过逐元素的二元交叉熵目标函数强制实现离散一致性,同时采用编码率正则化项促进受限通道的有效利用,从而形成结构化表征。本文进一步表明,周期性重新初始化投影头能通过鼓励嵌入在多次离散编码中保持可预测性来增强该效果。大量实验证明,在图像分类、检索、密集视觉预测任务以及通过自监督适应应对领域偏移的场景中,该方法始终优于连续一致性基线方法。除主干网络表征外,本文还对学习到的二进制编码进行了分析,表明其构成了紧凑且信息丰富的离散语言,能够捕获跨类别可复用的语义因子。