Differentiable Architecture Search (DARTS) is a simple yet efficient Neural Architecture Search (NAS) method. During the search stage, DARTS trains a supernet by jointly optimizing architecture parameters and network parameters. During the evaluation stage, DARTS discretizes the supernet to derive the optimal architecture based on architecture parameters. However, recent research has shown that during the training process, the supernet tends to converge towards sharp minima rather than flat minima. This is evidenced by the higher sharpness of the loss landscape of the supernet, which ultimately leads to a performance gap between the supernet and the optimal architecture. In this paper, we propose Self-Distillation Differentiable Neural Architecture Search (SD-DARTS) to alleviate the discretization gap. We utilize self-distillation to distill knowledge from previous steps of the supernet to guide its training in the current step, effectively reducing the sharpness of the supernet's loss and bridging the performance gap between the supernet and the optimal architecture. Furthermore, we introduce the concept of voting teachers, where multiple previous supernets are selected as teachers, and their output probabilities are aggregated through voting to obtain the final teacher prediction. Experimental results on real datasets demonstrate the advantages of our novel self-distillation-based NAS method compared to state-of-the-art alternatives.
翻译:可微架构搜索(DARTS)是一种简单而高效的神经架构搜索(NAS)方法。在搜索阶段,DARTS通过联合优化架构参数与网络参数来训练超网络。在评估阶段,DARTS对超网络进行离散化,基于架构参数导出最优架构。然而,近期研究表明,在训练过程中超网络倾向于收敛到尖锐最小值而非平坦最小值,这体现在超网络损失景观的更高尖锐性上,最终导致超网络与最优架构之间存在性能差距。本文提出自蒸馏可微神经架构搜索(SD-DARTS)以缓解离散化差距。我们利用自蒸馏将超网络先前步骤的知识提炼到当前步骤训练中,有效降低超网络损失曲面的尖锐程度,并弥合超网络与最优架构间的性能差距。此外,我们引入投票教师概念,选取多个先前超网络作为教师,通过投票聚合它们的输出概率以获得最终教师预测。真实数据集上的实验结果表明,与当前最优替代方法相比,我们提出的基于自蒸馏的NAS方法具有显著优势。