Soft random sampling (SRS) is a simple yet effective approach for efficient training of large-scale deep neural networks when dealing with massive data. SRS selects a subset uniformly at random with replacement from the full data set in each epoch. In this paper, we conduct a theoretical and empirical analysis of SRS. First, we analyze its sampling dynamics including data coverage and occupancy. Next, we investigate its convergence with non-convex objective functions and give the convergence rate. Finally, we provide its generalization performance. We empirically evaluate SRS for image recognition on CIFAR10 and automatic speech recognition on Librispeech and an in-house payload dataset to demonstrate its effectiveness. Compared to existing coreset-based data selection methods, SRS offers a better accuracy-efficiency trade-off. Especially on real-world industrial scale data sets, it is shown to be a powerful training strategy with significant speedup and competitive performance with almost no additional computing cost.
翻译:软随机采样(SRS)是一种简单而有效的方法,用于处理大规模深度神经网络在应对海量数据时的高效训练。SRS在每个训练周期中从完整数据集中进行有放回的均匀随机子集选择。本文对SRS进行了理论与实证分析。首先,我们分析了其采样动态特性,包括数据覆盖率和占用率。其次,我们研究了其在非凸目标函数下的收敛性,并给出了收敛速率。最后,我们提供了其泛化性能。我们在CIFAR10图像识别任务、Librispeech自动语音识别任务以及一个内部业务数据集上对SRS进行了实证评估,以证明其有效性。与现有的基于核心集的数据选择方法相比,SRS在准确性与效率之间实现了更优的权衡。特别是在实际工业规模数据集上,它表现出强大的训练策略效能,能够显著加速训练过程,且在几乎不增加计算成本的前提下保持竞争性能。