We present our latest findings on backchannel modeling novelly motivated by the canonical use of the minimal responses Yeah and Uh-huh in English and their correspondent tokens in German, and the effect of encoding the speaker-listener interaction. Backchanneling theories emphasize the active and continuous role of the listener in the course of the conversation, their effects on the speaker's subsequent talk, and the consequent dynamic speaker-listener interaction. Therefore, we propose a neural-based acoustic backchannel classifier on minimal responses by processing acoustic features from the speaker speech, capturing and imitating listeners' backchanneling behavior, and encoding speaker-listener interaction. Our experimental results on the Switchboard and GECO datasets reveal that in almost all tested scenarios the speaker or listener behavior embeddings help the model make more accurate backchannel predictions. More importantly, a proper interaction encoding strategy, i.e., combining the speaker and listener embeddings, leads to the best performance on both datasets in terms of F1-score.
翻译:我们提出了关于回馈建模的最新研究成果,该研究创新性地受英语中最小回应词"Yeah"和"Uh-huh"及其德语对应词汇的典型用法启发,并基于说话人-听者交互编码的效果。回馈理论强调听者在对话过程中起到的主动且持续的作用、其对说话人后续话语的影响,以及由此产生的动态说话人-听者交互。因此,我们提出一种基于神经网络的声学回馈分类器,通过处理说话人语音中的声学特征、捕获并模仿听者的回馈行为,以及编码说话人-听者交互。在Switchboard和GECO数据集上的实验结果表明,在几乎所有测试场景中,说话人或听者行为嵌入均能帮助模型更准确地预测回馈。更重要的是,合适的交互编码策略(即结合说话人与听者嵌入)在两个数据集上的F1分数均达到了最优性能。