Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.
翻译:传统RGB图像驱动的语音生成面临时间粒度不匹配问题——固定相机曝光时间不可避免地模糊了高频发音瞬态信号,而这正是渲染情感语音的关键要素。为突破这一瓶颈,我们提出EventSpeech——首个以文本为条件的框架,开创性地利用神经形态事件实现表现力语音生成,因为这些微秒级精度的事件天然与声学波形动态对齐。我们的架构集成了专用事件编码器以建模稀疏神经形态事件,并配备多尺度音频编码器(内含层次化小波语境化器HWC)。双向对齐机制无缝同步语言内容、视觉动态与密集声学特征。此外,我们构建了首个基准数据集EVT-SPK,包含大规模合成数据与专用神经形态硬件采集的真实场景录音。大量评估表明,EventSpeech通过保留细粒度情感与抵抗运动模糊显著超越当前基线,建立了多模态语音生成的新范式。代码与演示请见https://xrfang-0102.github.io/EventSpeechWeb/。