Streaming speaker diarization is crucial for time-critical medical dispatch, but deploying it on resource-constrained hardware requires smaller, faster models. Using SIMSAMU, a dataset of simulated medical-dispatch conversations, we evaluate streaming behavior before compressing the segmentation model with pruning and low-bit quantization. We characterize performance across a range of streaming latency budgets and find that additional buffering is not consistently beneficial, while very low-latency operating points can substantially degrade performance. Our study shows that model compression trades performance for memory footprint, and we highlight an operating point where FP16 reduces model size by half with essentially unchanged real-time factor, at a cost of a 40\% relative DER increase against the baseline. This work characterizes the trade-offs for real-time deployment and contributes to speech technology that can enable reliable human communication in time-critical contexts.
翻译:流式说话人分割对于时间紧迫的医疗调度至关重要,但在资源受限硬件上部署需要更小、更快的模型。我们利用模拟医疗调度对话数据集SIMSAMU,在通过剪枝和低比特量化压缩分割模型之前,评估其流式行为。我们针对一系列流式延迟预算刻画了性能表现,发现额外的缓冲并非始终有益,而极低延迟的工作点可能会显著降低性能。研究表明,模型压缩以性能换取内存占用,我们重点指出一个工作点:FP16在实时因子基本不变的情况下将模型大小减半,但代价是相对于基线,DER相对增加了40%。本研究刻画了实时部署中的权衡关系,有助于在时间关键型场景中实现可靠人类通信的语音技术。