Emotion Recognition (ER), Gender Recognition (GR), and Age Estimation (AE) constitute paralinguistic tasks that rely not on the spoken content but primarily on speech characteristics such as pitch and tone. While previous research has made significant strides in developing models for each task individually, there has been comparatively less emphasis on concurrently learning these tasks, despite their inherent interconnectedness. As such in this demonstration, we present PERSONA, an application for predicting ER, GR, and AE with a single model in the backend. One notable point is we show that representations from speaker recognition pre-trained model (PTM) is better suited for such a multi-task learning format than the state-of-the-art (SOTA) self-supervised (SSL) PTM by carrying out a comparative study. Our methodology obviates the need for deploying separate models for each task and can potentially conserve resources and time during the training and deployment phases.
翻译:情感识别(ER)、性别识别(GR)与年龄估计(AE)构成了一系列副语言任务,这些任务不依赖于说话内容,而主要依据语音特征(如音高和音调)进行推断。尽管先前研究在针对每项任务单独开发模型方面取得了显著进展,但对于同时学习这些内在相互关联的任务,关注相对较少。因此,在本演示中,我们提出了PERSONA,这是一个后端使用单一模型来预测ER、GR和AE的应用程序。一个值得注意的要点是,我们通过一项对比研究表明,来自说话人识别预训练模型(PTM)的表征,比当前最先进的自监督(SSL)PTM更适合此类多任务学习框架。我们的方法避免了为每项任务部署独立模型的需要,并可能在训练和部署阶段节省资源与时间。