Visual robustness and neural alignment remain critical challenges in developing artificial agents that can match biological vision systems. We present the winning approaches from Team HCMUS_TheFangs for both tracks of the NeurIPS 2025 Mouse vs. AI: Robust Visual Foraging Competition. For Track 1 (Visual Robustness), we demonstrate that architectural simplicity combined with targeted components yields superior generalization, achieving 95.4% final score with a lightweight two-layer CNN enhanced by Gated Linear Units and observation normalization. For Track 2 (Neural Alignment), we develop a deep ResNet-like architecture with 16 convolutional layers and GLU-based gating that achieves top-1 neural prediction performance with 17.8 million parameters. Our systematic analysis of ten model checkpoints trained between 60K to 1.14M steps reveals that training duration exhibits a non-monotonic relationship with performance, with optimal results achieved around 200K steps. Through comprehensive ablation studies and failure case analysis, we provide insights into why simpler architectures excel at visual robustness while deeper models with increased capacity achieve better neural alignment. Our results challenge conventional assumptions about model complexity in visuomotor learning and offer practical guidance for developing robust, biologically-inspired visual agents.
翻译:视觉鲁棒性与神经对齐仍然是开发能与生物视觉系统相匹敌的人工智能体所面临的关键挑战。本文介绍了HCMUS_TheFangs团队在NeurIPS 2025"鼠与AI:鲁棒视觉觅食竞赛"两个赛道中的获胜方案。在赛道1(视觉鲁棒性)中,我们证明了结构简洁性与针对性组件相结合能够产生卓越的泛化能力,通过采用由门控线性单元和观测归一化增强的轻量级两层CNN,最终得分达到95.4%。在赛道2(神经对齐)中,我们开发了一个深度类ResNet架构,包含16个卷积层和基于GLU的门控机制,以1780万参数实现了顶级的神经预测性能。我们对在6万至114万训练步数区间内训练的十个模型检查点进行了系统分析,发现训练时长与性能呈非单调关系,最佳结果出现在约20万步附近。通过全面的消融研究和失败案例分析,我们深入阐释了为何更简洁的架构在视觉鲁棒性方面表现优异,而具有更强容量的更深层模型则能实现更好的神经对齐。我们的研究结果挑战了关于视觉运动学习中模型复杂度的传统假设,并为开发鲁棒的、受生物启发的视觉智能体提供了实用指导。