We propose StyleCap, a method to generate natural language descriptions of speaking styles appearing in speech. Although most of conventional techniques for para-/non-linguistic information recognition focus on the category classification or the intensity estimation of pre-defined labels, they cannot provide the reasoning of the recognition result in an interpretable manner. As a first step towards an end-to-end method for generating speaking-style prompts from speech, i.e., automatic speaking-style captioning, StyleCap uses paired data of speech and natural language descriptions to train neural networks that predict prefix vectors fed into a large language model (LLM)-based text decoder from a speech representation vector. We explore an appropriate text decoder and speech feature representation suitable for this new task. The experimental results demonstrate that our StyleCap leveraging richer LLMs for the text decoder, speech self-supervised learning (SSL) features, and sentence rephrasing augmentation improves the accuracy and diversity of generated speaking-style captions. Samples of speaking-style captions generated by our StyleCap are publicly available.
翻译:我们提出StyleCap方法,用于生成描述语音中说话风格的自然语言。尽管传统的副语言/非语言信息识别技术主要关注预定义标签的分类或强度估计,但无法以可解释的方式提供识别结果的推理依据。作为从语音端到端生成说话风格提示(即自动说话风格描述)的首个尝试,StyleCap利用语音与自然语言描述的配对数据训练神经网络,从语音表征向量预测输入至基于大语言模型(LLM)的文本解码器的前缀向量。我们探索了适用于该新型任务的文本解码器与语音特征表征方案。实验结果表明,通过利用更强大的LLM作为文本解码器、语音自监督学习(SSL)特征以及句子改写增强技术,我们的StyleCap显著提升了生成说话风格描述的准确性与多样性。StyleCap生成的说话风格描述示例现已公开。