Despite the extensive literature on training loss functions, the evaluation of generalization on the validation set remains underexplored. In this work, we conduct a systematic empirical and statistical study of how the validation criterion used for model selection affects test performance in neural classifiers, with attention to early stopping. Using fully connected networks on standard benchmarks under $k$-fold evaluation, we compare: (i) early stopping with patience and (ii) post-hoc selection over all epochs (i.e. no early stopping). Models are trained with cross-entropy, C-Loss, or PolyLoss; the model parameter selection on the validation set is made using accuracy or one of the three loss functions, each considered independently. Three main findings emerge. (1) Early stopping based on validation accuracy performs worst, consistently selecting checkpoints with lower test accuracy than both loss-based early stopping and post-hoc selection. (2) Loss-based validation criteria yield comparable and more stable test accuracy. (3) Across datasets and folds, any single validation rule often underperforms the test-optimal checkpoint. Overall, the selected model typically achieves test-set performance statistically lower than the best performance across all epochs, regardless of the validation criterion. Our results suggest avoiding validation accuracy (in particular with early stopping) for parameter selection, favoring loss-based validation criteria.
翻译:尽管关于训练损失函数的文献浩如烟海,但在验证集上评估泛化性能的研究仍显不足。本研究通过系统性的实证与统计分析,探讨了用于模型选择的验证准则如何影响神经分类器的测试性能,并特别关注早停策略。我们在标准基准测试上使用全连接网络,采用$k$折交叉验证框架,比较了两种策略:(i) 基于耐心机制的早停法,与 (ii) 遍历所有训练轮次的后验选择法(即不使用早停)。模型训练采用交叉熵损失、C-Loss或PolyLoss;在验证集上进行模型参数选择时,则独立地使用准确率或三种损失函数之一作为选择依据。我们得到三个主要结论。(1) 基于验证准确率的早停法表现最差,其选择的检查点在测试准确率上持续低于基于损失的早停法及后验选择法。(2) 基于损失的验证准则能产生可比且更稳定的测试准确率。(3) 在不同数据集和交叉验证折次中,任何单一的验证规则所选出的模型,其性能常低于所有训练轮次中测试性能最优的检查点。总体而言,无论采用何种验证准则,最终选出的模型在测试集上的性能通常统计显著地低于所有轮次中的最佳性能。我们的结果表明,应避免使用验证准确率(尤其是结合早停策略)进行参数选择,而应优先考虑基于损失的验证准则。