Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization, Privacy, and Layout Fidelity

A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher's original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at https://github.com/hugohammer/TTS-Narrated-Ebook-Creator.git.

翻译：叙事电子书将同步音频与数字文本相结合，在播放时高亮显示当前朗读的单词或句子。该格式既有助于早期识字教育，也能辅助存在阅读障碍的个体，同时允许普通读者在阅读与听读模式间无缝切换。随着拟人化神经文本转语音（TTS）技术的兴起，多家商业服务已利用该技术将标准文本电子书转换为高质量的叙事电子书。然而，目前尚不存在可完成此任务的开源解决方案。本文提出Calliope开源框架以填补该空白。本方法采用前沿开源TTS技术，将文本电子书转换为符合EPUB 3媒体覆盖格式的叙事电子书。该方法包含多项创新步骤：在TTS过程中直接捕获音频时间戳，确保旁白与文本高亮的精确同步；严格保留出版商的原始排版、样式及嵌入媒体；整个流程支持离线运行。离线特性消除了持续产生的API成本，缓解了隐私顾虑，并规避了基于云服务可能涉及的版权合规问题。该框架目前支持前沿开源TTS系统XTTS-v2与Chatterbox。另一种潜在替代方案是先通过TTS生成旁白，再通过强制对齐实现与文本的同步。然而，尽管本方法能确保精确同步，实验表明强制对齐会导致音频与文本高亮间产生显著偏移，足以降低阅读体验。源代码与使用说明详见https://github.com/hugohammer/TTS-Narrated-Ebook-Creator.git。

相关内容

语音合成

关注 491

语音合成（Speech Synthesis），也称为文语转换（Text-to-Speech, TTS,它是将任意的输入文本转换成自然流畅的语音输出。语音合成涉及到人工智能、心理学、声学、语言学、数字信号处理、计算机科学等多个学科技术，是信息处理领域中的一项前沿技术。随着计算机技术的不断提高，语音合成技术从早期的共振峰合成,逐步发展为波形拼接合成和统计参数语音合成，再发展到混合语音合成；合成语音的质量、自然度已经得到明显提高，基本能满足一些特定场合的应用需求。目前，语音合成技术在银行、医院等的信息播报系统、汽车导航系统、自动应答呼叫中心等都有广泛应用，取得了巨大的经济效益。另外，随着智能手机、MP3、PDA 等与我们生活密切相关的媒介的大量涌现，语音合成的应用也在逐渐向娱乐、语音教学、康复治疗等领域深入。可以说语音合成正在影响着人们生活的方方面面。

迈向可控语音合成：大语言模型时代的综述

专知会员服务

24+阅读 · 2024年12月13日

[ICML2024] Spotlight|DAT：通过交互式注意力实现统一的多粒度文本检测

专知会员服务

19+阅读 · 2024年6月26日