The use of mini-batches of data in training artificial neural networks is nowadays very common. Despite its broad usage, theories explaining quantitatively how large or small the optimal mini-batch size should be are missing. This work presents a systematic attempt at understanding the role of the mini-batch size in training two-layer neural networks. Working in the teacher-student scenario, with a sparse teacher, and focusing on tasks of different complexity, we quantify the effects of changing the mini-batch size $m$. We find that often the generalization performances of the student strongly depend on $m$ and may undergo sharp phase transitions at a critical value $m_c$, such that for $m<m_c$ the training process fails, while for $m>m_c$ the student learns perfectly or generalizes very well the teacher. Phase transitions are induced by collective phenomena firstly discovered in statistical mechanics and later observed in many fields of science. Finding a phase transition varying the mini-batch size raises several important questions on the role of a hyperparameter which have been somehow overlooked until now.
翻译:在训练人工神经网络时,使用小批量数据如今非常普遍。尽管其应用广泛,但定量解释最优小批量大小应如何选择的理论尚缺失。本工作系统性地尝试理解小批量大小在训练两层神经网络中的作用。在教师-学生场景下(教师为稀疏网络),针对不同复杂度的任务,我们量化了改变小批量大小 $m$ 的影响。我们发现,学生的泛化性能通常强烈依赖于 $m$,并可能在临界值 $m_c$ 处发生急剧的相变:当 $m<m_c$ 时,训练过程失败;而当 $m>m_c$ 时,学生能完美学习或极好地泛化教师网络。相变由统计力学中首次发现并后经多科学领域观测到的集体现象所引发。发现小批量大小变化导致的相变,提出了关于这一超参数作用的若干重要问题——这些问题迄今为止在某种程度上被忽视了。