Millions of people listen to podcasts, audio stories, and lectures, but editing speech remains tedious and time-consuming. Creators remove unnecessary words, cut tangential discussions, and even re-record speech to make recordings concise and engaging. Prior work automatically summarized speech by removing full sentences (extraction), but rigid extraction limits expressivity. AI tools can summarize then re-synthesize speech (abstraction), but abstraction strips the speaker's style. We present TalkLess, a system that flexibly combines extraction and abstraction to condense speech while preserving its content and style. To edit speech, TalkLess first generates possible transcript edits, selects edits to maximize compression, coverage, and audio quality, then uses a speech editing model to translate transcript edits into audio edits. TalkLess's interface provides creators control over automated edits by separating low-level wording edits (via the compression pane) from major content edits (via the outline pane). TalkLess achieves higher coverage and removes more speech errors than a state-of-the-art extractive approach. A comparison study (N=12) showed that TalkLess significantly decreased cognitive load and editing effort in speech editing. We further demonstrate TalkLess's potential in an exploratory study (N=3) where creators edited their own speech.
翻译:数百万人收听播客、音频故事和讲座,但语音编辑仍然繁琐耗时。创作者需要删除冗余词汇、剪除无关讨论甚至重新录制语音,以使录音内容精炼且富有吸引力。现有研究通过删除完整句子(抽取式)实现语音自动摘要,但僵化的抽取方式限制了表达灵活性。人工智能工具可先摘要后重新合成语音(生成式),但生成过程会剥离说话者的风格特征。本文提出TalkLess系统,通过灵活结合抽取与生成技术,在压缩语音时同时保留其内容与风格。该系统首先生成可能的文本编辑方案,根据压缩率、内容覆盖度与音频质量优化选择编辑策略,随后运用语音编辑模型将文本编辑转化为音频编辑。TalkLess的交互界面通过分离底层措辞编辑(压缩面板)与主体内容编辑(大纲面板),赋予创作者对自动化编辑流程的控制权。相较于前沿的抽取式方法,TalkLess实现了更高的内容覆盖度并消除了更多语音错误。对比实验(N=12)表明TalkLess显著降低了语音编辑的认知负荷与操作强度。我们通过探索性研究(N=3)进一步展示了创作者在编辑自身语音时使用TalkLess的潜力。