Professional vocalists modulate their voice timbre or pitch to make their vocal performance more expressive. Such fluctuations are called singing techniques. Automatic detection of singing techniques from audio tracks can be beneficial to understand how each singer expresses the performance, yet it can also be difficult due to the wide variety of the singing techniques. A deep neural network (DNN) model can handle such variety; however, there might be a possibility that considering the characteristics of the data improves the performance of singing technique detection. In this paper, we propose PrimaDNN, a CRNN model with a characteristics-oriented improvement. The features of the model are: 1) input feature representation based on auxiliary pitch information and multi-resolution mel spectrograms, 2) Convolution module based on the Squeeze-and-excitation (SENet) and the Instance normalization. In the results of J-POP singing technique detection, PrimaDNN achieved the best results of 44.9% at the overall macro-F measure, compared to conventional works. We also found that the contribution of each component varies depending on the type of singing technique.
翻译:专业歌手通过调节音色或音高使演唱更具表现力,此类波动被称为歌唱技巧。从音频中自动检测歌唱技巧有助于理解每位歌手的演唱表达方式,但由于歌唱技巧种类繁多,这一任务颇具挑战性。深度神经网络(DNN)模型能够处理这种多样性,然而,考虑数据特征或许能进一步提升歌唱技巧检测性能。本文提出PrimaDNN——一种面向特征优化的CRNN模型,其技术特色包括:1)基于辅助音高信息与多分辨率梅尔频谱图的输入特征表示;2)基于挤压激励网络(SENet)与实例归一化的卷积模块。在J-POP歌唱技巧检测任务中,PrimaDNN以44.9%的整体宏F值取得了最优结果,优于传统方法。我们还发现各模块的贡献度因歌唱技巧类型而异。