Non-verbal signals in speech are encoded by prosody and carry information that ranges from conversation action to attitude and emotion. Despite its importance, the principles that govern prosodic structure are not yet adequately understood. This paper offers an analytical schema and a technological proof-of-concept for the categorization of prosodic signals and their association with meaning. The schema interprets surface-representations of multi-layered prosodic events. As a first step towards implementation, we present a classification process that disentangles prosodic phenomena of three orders. It relies on fine-tuning a pre-trained speech recognition model, enabling the simultaneous multi-class/multi-label detection. It generalizes over a large variety of spontaneous data, performing on a par with, or superior to, human annotation. In addition to a standardized formalization of prosody, disentangling prosodic patterns can direct a theory of communication and speech organization. A welcome by-product is an interpretation of prosody that will enhance speech- and language-related technologies.
翻译:论文摘要:语音中的非语言信号通过韵律进行编码,传递着从对话行为到态度和情感等丰富信息。尽管韵律结构至关重要,但其支配原则至今尚未得到充分理解。本文提出了一套分析框架和技术概念验证方案,用于对韵律信号进行分类并关联其意义。该框架解析了多层次韵律事件的表层表征。作为实现该方案的第一步,我们提出了一种分类流程,可解耦三个阶次的韵律现象。该方法基于对预训练语音识别模型的微调,能够实现多类别/多标签的同步检测,并在大量自发性语音数据上表现出泛化能力,其性能与人工标注相当甚至更优。除了为韵律提供标准化形式描述外,解耦韵律模式还可指导交际理论与语音组织理论的发展。值得关注的副产品是,对韵律的阐释将增强语音及语言相关技术的性能。