We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.
翻译:我们提出TRAMBA,一种适用于移动与可穿戴平台的声学与骨传导语音增强混合Transformer与Mamba架构。骨传导语音增强技术此前难以在移动与可穿戴平台上实际应用,原因如下:(i) 数据收集过程劳动密集,导致数据稀缺;(ii) 现有数百MB内存占用的最先进模型与更适合资源受限系统的方法之间存在性能差距。为使TRAMBA适应基于振动的传感模式,我们使用广泛可得的音频语音数据集对TRAMBA进行预训练。随后,用户仅需少量骨传导数据即可进行微调。TRAMBA在PESQ指标上最高超越现有最优GAN模型7.3%,在STOI指标上提升1.8%,同时内存占用降低一个数量级,推理速度最高提升465倍。我们将TRAMBA集成至实际系统,结果表明:(i) 通过减少数据采样与传输需求,可将可穿戴设备电池续航最高延长160%;(ii) 在嘈杂环境中生成的语音质量优于空中传播语音;(iii) 内存占用低于20.0 MB。