TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of audio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio's development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance.
翻译:TorchAudio是一个为PyTorch构建的开源音频与语音处理库。它通过提供设计精良、易于使用且高性能的PyTorch组件,旨在加速音频与语音技术的研究与开发。其贡献者定期与用户交流以了解需求,并通过开发具有影响力的功能来满足这些需求。本文概述了TorchAudio的开发原则与内容,并重点介绍了其最新版本(2.1)中的关键特性:自监督学习预训练流程与训练方案、高性能CTC解码器、语音识别模型与训练方案、先进的媒体输入输出能力,以及用于强制执行对齐、多通道语音增强和无参考语音评估的工具。通过实证研究,我们证明了其中部分特性的有效性,并表明它们达到了具有竞争力或最先进的性能水平。