Learning representative embeddings for different types of speaking styles, such as emotion, age, and gender, is critical for both recognition tasks (e.g., cognitive computing and human-computer interaction) and generative tasks (e.g., style-controllable speech generation). In this work, we introduce ParaMETA, a unified and flexible framework for learning and controlling speaking styles directly from speech. Unlike existing methods that rely on single-task models or cross-modal alignment, ParaMETA learns disentangled, task-specific embeddings by projecting speech into dedicated subspaces for each type of style. This design reduces inter-task interference, mitigates negative transfer, and allows a single model to handle multiple paralinguistic tasks such as emotion, gender, age, and language classification. Beyond recognition, ParaMETA enables fine-grained style control in Text-To-Speech (TTS) generative models. It supports both speech- and text-based prompting and allows users to modify one speaking styles while preserving others. Extensive experiments demonstrate that ParaMETA outperforms strong baselines in classification accuracy and generates more natural and expressive speech, while maintaining a lightweight and efficient model suitable for real-world applications.
翻译:学习不同类型说话风格(如情感、年龄和性别)的表示嵌入,对于识别任务(如认知计算与人机交互)和生成任务(如风格可控语音生成)均至关重要。本文提出ParaMETA,一个统一且灵活的框架,用于直接从语音中学习并控制说话风格。与现有依赖单任务模型或跨模态对齐的方法不同,ParaMETA通过将语音投影到针对每种风格类型的专用子空间,学习解缠的、任务特定的嵌入。该设计减少了任务间干扰,缓解了负迁移,并允许单一模型处理多种副语言任务,如情感、性别、年龄和语言分类。除识别任务外,ParaMETA还能在文本到语音(TTS)生成模型中实现细粒度的风格控制。它支持基于语音和文本的提示,并允许用户在修改一种说话风格的同时保持其他风格不变。大量实验表明,ParaMETA在分类准确率上优于强基线模型,并能生成更自然、更具表现力的语音,同时保持轻量高效的模型特性,适合实际应用。