Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.
翻译:鉴于生成式人工智能技术的最新进展,一个关键问题在于如何利用冻结的预训练自动语音识别模型输出的文本解码结果,通过大语言模型来增强声学建模任务。为探索语音处理中语言建模的新能力,我们提出了生成式语音转录错误修正挑战。该挑战包含三项后ASR语言建模任务:(i) 后ASR转录修正,(ii) 说话人标记,以及(iii) 情感识别。这些任务旨在模拟未来基于LLM的智能体处理语音接口的场景,同时通过采用开放的预训练语言模型或基于智能体的API保持对广泛受众的可及性。我们还讨论了基线评估中的发现,以及为未来评估设计所总结的经验教训。