Millions of people listen to podcasts, audio stories, and lectures, but editing speech remains tedious and time-consuming. Creators remove unnecessary words, cut tangential discussions, and even re-record speech to make recordings concise and engaging. Prior work automatically summarized speech by removing full sentences (extraction), but rigid extraction limits expressivity. AI tools can summarize then re-synthesize speech (abstraction), but abstraction strips the speaker's style. We present TalkLess, a system that flexibly combines extraction and abstraction to condense speech while preserving its content and style. To edit speech, TalkLess first generates possible transcript edits, selects edits to maximize compression, coverage, and audio quality, then uses a speech editing model to translate transcript edits into audio edits. TalkLess's interface provides creators control over automated edits by separating low-level wording edits (via the compression pane) from major content edits (via the outline pane). TalkLess achieves higher coverage and removes more speech errors than a state-of-the-art extractive approach. A comparison study (N=12) showed that TalkLess significantly decreased cognitive load and editing effort in speech editing. We further demonstrate TalkLess's potential in an exploratory study (N=3) where creators edited their own speech.
翻译:数百万人收听播客、音频故事和讲座,但语音编辑仍然繁琐且耗时。创作者需要删除不必要的词语、剪除无关讨论,甚至重新录制语音以使录音简洁且引人入胜。先前的研究通过删除完整句子(抽取式)来自动摘要语音,但僵化的抽取限制了表达灵活性。人工智能工具可以摘要语音后重新合成(生成式),但生成过程会剥离说话者的风格。本文提出TalkLess系统,该系统灵活结合抽取与生成方法以压缩语音,同时保留其内容与风格。为编辑语音,TalkLess首先生成可能的文本编辑方案,选择能最大化压缩率、内容覆盖度和音频质量的编辑方案,随后使用语音编辑模型将文本编辑转化为音频编辑。TalkLess的界面通过分离底层措辞编辑(通过压缩面板)与主要内容编辑(通过大纲面板),赋予创作者对自动化编辑的控制权。与最先进的抽取式方法相比,TalkLess实现了更高的内容覆盖率并移除了更多语音错误。一项对比研究(N=12)表明,TalkLess显著降低了语音编辑中的认知负荷与编辑工作量。我们通过一项探索性研究(N=3)进一步展示了TalkLess的潜力,在该研究中创作者编辑了自身的语音内容。