We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.
翻译:我们提出TRAMBA——一种适用于声学与骨传导语音增强的混合Transformer与Mamba架构,专为移动与可穿戴平台设计。骨传导语音增强长期难以在移动与可穿戴平台中应用,其原因包括:(i)数据采集劳动密集导致样本稀缺;(ii)当前最先进模型(内存占用达数百MB)与更适合资源受限系统的方法之间存在性能鸿沟。为使TRAMBA适应基于振动的传感模态,我们利用广泛可用的音频语音数据集对其进行预训练,随后用户仅需少量骨传导数据即可完成微调。TRAMBA在PESQ指标上比当前最先进的GAN模型提升高达7.3%,在STOI指标上提升1.8%,同时内存占用降低一个数量级,推理速度提升高达465倍。我们将TRAMBA集成至实际系统,实验表明:(i)通过减少数据采样与传输,使可穿戴设备电池续航提升高达160%;(ii)在嘈杂环境中生成比空中传输语音更高质量的语音;(iii)内存占用小于20.0 MB。