Representation learning is one of the key research topics in machine learning, and the framework of self-supervised learning (SSL) has revolutionized computer vision. However, these approaches have not yet fully leveraged insights from biological visual processing systems. In this paper, we introduce PhiNet v2, a novel architecture that processes temporal visual input (i.e., sequences of images) without relying on strong data augmentation, enabling it to learn robust visual representations in a manner similar to human visual processing. Our learning objective is derived from variational inference. Through extensive experiments, we demonstrate that PhiNet v2 achieves competitive performance compared to state-of-the-art vision representation models, including RSP and CropMAE, while retaining the ability to learn effectively from sequential input without strong data augmentation. This work represents a step toward more biologically plausible computer vision systems that process visual information in a manner more aligned with human cognitive processes.
翻译:暂无翻译