The use of mini-batches of data in training artificial neural networks is nowadays very common. Despite its broad usage, theories explaining quantitatively how large or small the optimal mini-batch size should be are missing. This work presents a systematic attempt at understanding the role of the mini-batch size in training two-layer neural networks. Working in the teacher-student scenario, with a sparse teacher, and focusing on tasks of different complexity, we quantify the effects of changing the mini-batch size $m$. We find that often the generalization performances of the student strongly depend on $m$ and may undergo sharp phase transitions at a critical value $m_c$, such that for $m<m_c$ the training process fails, while for $m>m_c$ the student learns perfectly or generalizes very well the teacher. Phase transitions are induced by collective phenomena firstly discovered in statistical mechanics and later observed in many fields of science. Finding a phase transition varying the mini-batch size raises several important questions on the role of a hyperparameter which have been somehow overlooked until now.
翻译:在人工神经网络训练中使用小批量数据如今已非常普遍。尽管应用广泛,但目前尚缺乏定量解释最优小批量大小应取大或取小的理论。本研究系统探讨了小批量大小在两层神经网络训练中的作用。基于教师-学生模型,采用稀疏教师网络,并聚焦于不同复杂度的任务,我们量化了改变小批量大小$m$的影响。研究发现,学生的泛化性能通常强烈依赖于$m$,并可能在临界值$m_c$处发生剧烈的相变:当$m<m_c$时训练失败,而当$m>m_c$时学生能完美学习或良好泛化教师网络。相变由最早在统计力学中发现、随后在众多科学领域观察到的集体现象所诱发。发现小批量大小变化引发的相变,引发了对这一超参数作用的重要疑问——而这一问题迄今为止多少被忽视了。