Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration

The problem of audio-to-text alignment has seen significant amount of research using complete supervision during training. However, this is typically not in the context of long audio recordings wherein the text being queried does not appear verbatim within the audio file. This work is a collaboration with a non-governmental organization called CARE India that collects long audio health surveys from young mothers residing in rural parts of Bihar, India. Given a question drawn from a questionnaire that is used to guide these surveys, we aim to locate where the question is asked within a long audio recording. This is of great value to African and Asian organizations that would otherwise have to painstakingly go through long and noisy audio recordings to locate questions (and answers) of interest. Our proposed framework, INDENT, uses a cross-attention-based model and prior information on the temporal ordering of sentences to learn speech embeddings that capture the semantics of the underlying spoken text. These learnt embeddings are used to retrieve the corresponding audio segment based on text queries at inference time. We empirically demonstrate the significant effectiveness (improvement in R-avg of about 3%) of our model over those obtained using text-based heuristics. We also show how noisy ASR, generated using state-of-the-art ASR models for Indian languages, yields better results when used in place of speech. INDENT, trained only on Hindi data is able to cater to all languages supported by the (semantically) shared text space. We illustrate this empirically on 11 Indic languages.

翻译：音频与文本对齐问题在训练过程中使用完全监督的方法已得到了大量研究。然而，这通常不适用于长音频记录，其中所查询的文本并未逐字出现在音频文件中。本研究与一个名为CARE India的非政府组织合作，该组织收集了印度比哈尔邦农村地区年轻母亲的长时间音频健康调查。给定一份用于指导这些调查的问卷中的一个问题，我们的目标是定位该问题在长音频记录中出现的位置。这对于非洲和亚洲的组织非常有价值，否则它们必须费力地浏览冗长且嘈杂的音频记录来定位感兴趣的问题（和答案）。我们提出的框架INDENT使用基于交叉注意力模型和时间顺序的先验信息来学习语音嵌入，从而捕捉底层口语文本的语义。这些学习到的嵌入在推理时用于基于文本查询检索相应的音频片段。我们通过实验证明了我们的模型相对于基于文本的启发式方法具有显著有效性（R-avg提升约3%）。我们还展示了使用最先进的印度语言ASR模型生成的噪声ASR在替代语音时能产生更好的结果。仅针对印地语数据训练的INDENT能够适应（语义上）共享文本空间支持的所有语言。我们通过11种印度语言实验验证了这一点。