This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while keeping style prediction performance.
翻译:本文提出了一种新颖的说话风格描述方法,能够在准确预测说话风格信息的同时生成多样化的描述。传统学习准则直接使用原始描述文本,这些文本不仅包含说话风格因素术语,还包含语法词汇,这会干扰说话风格信息的学习。为解决此问题,我们引入了因素条件描述生成方法,该方法首先生成表示说话风格因素(如性别、音高等)的短语,随后生成完整描述,以确保模型能够显式学习说话风格因素。我们还提出了贪心-采样解码策略,该策略首先通过确定性预测生成说话风格因素以保证语义准确性,随后基于因素条件采样生成描述以确保多样性。实验表明,因素条件描述生成方法优于基于原始描述的训练方式,结合贪心-采样解码策略后,能够在保持风格预测性能的同时生成更具多样性的描述。