Automatic music transcription (AMT) aims to convert raw audio to symbolic music representation. As a fundamental problem of music information retrieval (MIR), AMT is considered a difficult task even for trained human experts due to overlap of multiple harmonics in the acoustic signal. On the other hand, speech recognition, as one of the most popular tasks in natural language processing, aims to translate human spoken language to texts. Based on the similar nature of AMT and speech recognition (as they both deal with tasks of translating audio signal to symbolic encoding), this paper investigated whether a generic neural network architecture could possibly work on both tasks. In this paper, we introduced our new neural network architecture built on top of the current state-of-the-art Onsets and Frames, and compared the performances of its multiple variations on AMT task. We also tested our architecture with the task of speech recognition. For AMT, our models were able to produce better results compared to the model trained using the state-of-art architecture; however, although similar architecture was able to be trained on the speech recognition task, it did not generate very ideal result compared to other task-specific models.
翻译:自动音乐转录(AMT)旨在将原始音频转换为符号化音乐表示。作为音乐信息检索(MIR)领域的基础问题,AMT因声学信号中多个谐波的重叠而被视为一项对专业人类专家而言也具有挑战性的任务。另一方面,语音识别作为自然语言处理中最热门的任务之一,旨在将人类口语转换为文本。基于AMT与语音识别在本质上的相似性(两者均涉及将音频信号转换为符号编码的任务),本文探究了通用神经网络架构是否可能同时适用于这两项任务。我们构建了基于当前最优模型Onsets and Frames的新型神经网络架构,并比较了其多种变体在AMT任务上的性能。同时,我们还将该架构应用于语音识别任务进行测试。实验表明:在AMT任务中,我们的模型相比基于最先进架构训练的模型取得了更优结果;然而,尽管相似的架构能够用于训练语音识别任务,但与专有模型相比,其产生的效果并不理想。