In this paper, we study an underexplored, yet important and challenging problem: counting the number of distinct sounds in raw audio characterized by a high degree of polyphonicity. We do so by systematically proposing a novel end-to-end trainable neural network (which we call DyDecNet, consisting of a dyadic decomposition front-end and backbone network), and quantifying the difficulty level of counting depending on sound polyphonicity. The dyadic decomposition front-end progressively decomposes the raw waveform dyadically along the frequency axis to obtain time-frequency representation in multi-stage, coarse-to-fine manner. Each intermediate waveform convolved by a parent filter is further processed by a pair of child filters that evenly split the parent filter's carried frequency response, with the higher-half child filter encoding the detail and lower-half child filter encoding the approximation. We further introduce an energy gain normalization to normalize sound loudness variance and spectrum overlap, and apply it to each intermediate parent waveform before feeding it to the two child filters. To better quantify sound counting difficulty level, we further design three polyphony-aware metrics: polyphony ratio, max polyphony and mean polyphony. We test DyDecNet on various datasets to show its superiority, and we further show dyadic decomposition network can be used as a general front-end to tackle other acoustic tasks.
翻译:本文研究了一个尚未充分探索但重要且具有挑战性的问题:对具有高度多音性的原始音频中不同声音的数量进行计数。为此,我们系统性地提出了一种新颖的端到端可训练神经网络(称为DyDecNet,由二进分解前端和骨干网络组成),并基于声音多音性量化了计数的难度级别。二进分解前端沿频率轴逐步对原始波形进行二元分解,以多阶段、由粗到细的方式获取时频表示。每个经父滤波器卷积的中间波形进一步由一对子滤波器处理,这对子滤波器均分父滤波器所携带的频率响应,其中高频半部分子滤波器编码细节,低频半部分子滤波器编码近似。我们还引入了能量增益归一化以标准化声音响度方差和频谱重叠,并将其应用于每个中间父波形输入到两个子滤波器之前。为更好地量化声音计数难度,我们进一步设计了三种多音性感知指标:多音性比例、最大多音性和平均多音性。我们在多个数据集上测试了DyDecNet以展示其优越性,并进一步表明二进分解网络可作为通用前端解决其他声学任务。