Deep learning models have become widely adopted in various domains, but their performance heavily relies on a vast amount of data. Datasets often contain a large number of irrelevant or redundant samples, which can lead to computational inefficiencies during the training. In this work, we introduce, for the first time in the context of the audio domain, the k-means clustering as a method for efficient data pruning. K-means clustering provides a way to group similar samples together, allowing the reduction of the size of the dataset while preserving its representative characteristics. As an example, we perform clustering analysis on the keyword spotting (KWS) dataset. We discuss how k-means clustering can significantly reduce the size of audio datasets while maintaining the classification performance across neural networks (NNs) with different architectures. We further comment on the role of scaling analysis in identifying the optimal pruning strategies for a large number of samples. Our studies serve as a proof-of-principle, demonstrating the potential of data selection with distance-based clustering algorithms for the audio domain and highlighting promising research avenues.
翻译:深度学习模型已在各个领域得到广泛应用,但其性能高度依赖于海量数据。数据集通常包含大量无关或冗余样本,这会导致训练过程中的计算效率低下。本文首次在音频领域引入k-means聚类作为一种高效的数据剪枝方法。k-means聚类能够将相似样本归组,从而在保持数据集代表性特征的同时缩减其规模。以关键词识别(KWS)数据集为例,我们进行了聚类分析。本文讨论了k-means聚类如何能在保持不同架构神经网络(NN)分类性能的前提下,显著缩减音频数据集的规模。我们进一步探讨了规模分析在确定大数据量下最优剪枝策略中的作用。本研究作为原理验证,展示了基于距离的聚类算法在音频领域数据选择中的潜力,并指出了具有前景的研究方向。