This paper proposes to use both audio input and subject information to predict the personalized preference of two audio segments with the same content in different qualities. A siamese network is used to compare the inputs and predict the preference. Several different structures for each side of the siamese network are investigated, and an LDNet with PANNs' CNN6 as the encoder and a multi-layer perceptron block as the decoder outperforms a baseline model using only audio input the most, where the overall accuracy grows from 77.56% to 78.04%. Experimental results also show that using all the subject information, including age, gender, and the specifications of headphones or earphones, is more effective than using only a part of them.
翻译:本文提出利用音频输入与主体信息,预测内容相同但质量不同的两段音频的个性化偏好。采用孪生网络对输入进行比较并预测偏好。研究了孪生网络两侧的多种不同结构,其中以PANNs的CNN6为编码器、多层感知机模块为解码器的LDNet,相较于仅使用音频输入的基线模型,性能提升最为显著,整体准确率从77.56%提升至78.04%。实验结果表明,使用全部主体信息(包括年龄、性别以及头戴式耳机或入耳式耳机的规格参数)比仅使用部分信息更为有效。