Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.
翻译:人类交流具有多模态特性,语音与手势紧密耦合。然而,现有的大多数语音与手势生成计算方法采用顺序合成策略,削弱了模态间的同步性与韵律对齐。本文提出Gelina——一个基于离散自回归主干网络与模态特定解码器的统一框架,通过交错令牌序列从文本联合合成语音及伴随手势。Gelina支持多说话人与多风格克隆,并能基于语音输入实现纯手势合成。主客观实验表明,该系统在语音质量方面达到竞争性水平,且在手势生成效果上优于单模态基线方法。