Since humans can listen to audio and watch videos at faster speeds than actually observed, we often listen to or watch these pieces of content at higher playback speeds to increase the time efficiency of content comprehension. To further utilize this capability, systems that automatically adjust the playback speed according to the user's condition and the type of content to assist in more efficient comprehension of time-series content have been developed. However, there is still room for these systems to further extend human speed-listening ability by generating speech with playback speed optimized for even finer time units and providing it to humans. In this study, we determine whether humans can hear the optimized speech and propose a system that automatically adjusts playback speed at units as small as phonemes while ensuring speech intelligibility. The system uses the speech recognizer score as a proxy for how well a human can hear a certain unit of speech and maximizes the speech playback speed to the extent that a human can hear. This method can be used to produce fast but intelligible speech. In the evaluation experiment, we compared the speech played back at a constant fast speed and the flexibly speed-up speech generated by the proposed method in a blind test and confirmed that the proposed method produced speech that was easier to listen to.
翻译:人类能够以比实际观察更快的速度收听音频和观看视频,因此我们常以较高播放速度收听或观看此类内容,以提高内容理解的时间效率。为充分利用这一能力,已有系统可根据用户状态及内容类型自动调整播放速度,从而辅助更高效地理解时序内容。然而,这些系统仍存在提升空间:通过生成针对更细微时间单位优化的播放速度语音并将其呈现给人类,可进一步拓展人类的速度收听能力。本研究旨在确定人类能否听清优化后的语音,并提出一个系统,该系统能在确保语音清晰度的前提下,以音素级别的微小单位自动调整播放速度。系统利用语音识别器得分作为衡量人类对特定语音单位听清程度的代理指标,并在人类可听清范围内最大化语音播放速度。该方法可用于生成快速但可理解的语音。在评估实验中,我们通过盲测比较了以恒定快速播放的语音与本文方法生成的灵活加速语音,并证实了本文方法生成的语音更易于收听。