A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher's original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at https://github.com/hugohammer/TTS-Narrated-Ebook-Creator.git.
翻译:叙事电子书将同步音频与数字文本相结合,在播放时高亮显示当前朗读的单词或句子。该格式既有助于早期识字教育,也能辅助存在阅读障碍的个体,同时允许普通读者在阅读与听读模式间无缝切换。随着拟人化神经文本转语音(TTS)技术的兴起,多家商业服务已利用该技术将标准文本电子书转换为高质量的叙事电子书。然而,目前尚不存在可完成此任务的开源解决方案。本文提出Calliope开源框架以填补该空白。本方法采用前沿开源TTS技术,将文本电子书转换为符合EPUB 3媒体覆盖格式的叙事电子书。该方法包含多项创新步骤:在TTS过程中直接捕获音频时间戳,确保旁白与文本高亮的精确同步;严格保留出版商的原始排版、样式及嵌入媒体;整个流程支持离线运行。离线特性消除了持续产生的API成本,缓解了隐私顾虑,并规避了基于云服务可能涉及的版权合规问题。该框架目前支持前沿开源TTS系统XTTS-v2与Chatterbox。另一种潜在替代方案是先通过TTS生成旁白,再通过强制对齐实现与文本的同步。然而,尽管本方法能确保精确同步,实验表明强制对齐会导致音频与文本高亮间产生显著偏移,足以降低阅读体验。源代码与使用说明详见https://github.com/hugohammer/TTS-Narrated-Ebook-Creator.git。