Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL) model in speech processing. However, we argue that its fixed 20ms resolution for hidden representations would not be optimal for various speech-processing tasks since their attributes (e.g., speaker characteristics and semantics) are based on different time scales. To address this limitation, we propose utilizing HuBERT representations at multiple resolutions for downstream tasks. We explore two approaches, namely the parallel and hierarchical approaches, for integrating HuBERT features with different resolutions. Through experiments, we demonstrate that HuBERT with multiple resolutions outperforms the original model. This highlights the potential of utilizing multiple resolutions in SSL models like HuBERT to capture diverse information from speech signals.
翻译:隐藏单元BERT(HuBERT)是语音处理领域广泛使用的自监督学习(SSL)模型。然而,我们认为其固定20ms分辨率的隐藏表征难以适用于各类语音处理任务,因为任务属性(如说话人特征与语义信息)基于不同的时间尺度。为解决这一局限,我们提出在下游任务中采用多分辨率HuBERT表征。我们探索了并行与层级两种方法,用于整合不同分辨率的HuBERT特征。实验证明,多分辨率HuBERT的性能优于原始模型。这凸显了在HuBERT等SSL模型中利用多分辨率表征以捕捉语音信号多样化信息的潜力。