In this paper, we introduce the Temporal Audio Source Counting Network (TaCNet), an innovative architecture that addresses limitations in audio source counting tasks. TaCNet operates directly on raw audio inputs, eliminating complex preprocessing steps and simplifying the workflow. Notably, it excels in real-time speaker counting, even with truncated input windows. Our extensive evaluation, conducted using the LibriCount dataset, underscores TaCNet's exceptional performance, positioning it as a state-of-the-art solution for audio source counting tasks. With an average accuracy of 74.18 percentage over 11 classes, TaCNet demonstrates its effectiveness across diverse scenarios, including applications involving Chinese and Persian languages. This cross-lingual adaptability highlights its versatility and potential impact.
翻译:本文提出时间音频源计数网络(TaCNet),这是一种解决音频源计数任务中现有局限性的创新架构。TaCNet可直接处理原始音频输入,省去复杂预处理步骤并简化工作流程。值得注意的是,即使在输入窗口截断的情况下,该网络在实时说话人计数任务中仍表现卓越。我们使用LibriCount数据集进行的全面评估凸显了TaCNet的卓越性能,使其成为音频源计数任务的先进解决方案。在11个类别上达到74.18%的平均准确率,TaCNet展示了其在中英文及波斯语等应用场景中的有效性,这种跨语言适应性充分体现了其通用性与潜在影响力。