Automatic speech recognition (ASR) systems play a key role in applications involving human-machine interactions. Despite their importance, ASR models for the Portuguese language proposed in the last decade have limitations in relation to the correct identification of punctuation marks in automatic transcriptions, which hinder the use of transcriptions by other systems, models, and even by humans. However, recently Whisper ASR was proposed by OpenAI, a general-purpose speech recognition model that has generated great expectations in dealing with such limitations. This chapter presents the first study on the performance of Whisper for punctuation prediction in the Portuguese language. We present an experimental evaluation considering both theoretical aspects involving pausing points (comma) and complete ideas (exclamation, question, and fullstop), as well as practical aspects involving transcript-based topic modeling - an application dependent on punctuation marks for promising performance. We analyzed experimental results from videos of Museum of the Person, a virtual museum that aims to tell and preserve people's life histories, thus discussing the pros and cons of Whisper in a real-world scenario. Although our experiments indicate that Whisper achieves state-of-the-art results, we conclude that some punctuation marks require improvements, such as exclamation, semicolon and colon.
翻译:自动语音识别(ASR)系统在人机交互应用中扮演着关键角色。尽管近年来针对葡萄牙语提出的ASR模型具有重要意义,但它们在自动转录中标点符号的准确识别方面仍存在局限性,这阻碍了其他系统、模型乃至人类对转录结果的使用。然而,OpenAI近期提出的通用语音识别模型Whisper在处理此类局限方面展现了巨大潜力。本章首次研究了Whisper在葡萄牙语标点预测中的性能表现。我们通过实验评估,既考虑了涉及停顿点(逗号)和完整语义(感叹号、问号、句号)的理论层面,也涵盖了基于转录的主题建模这一实际应用场景——该应用高度依赖标点符号以获得理想性能。我们对人博物馆(一个旨在记录和保存人们生命史的虚拟博物馆)视频的实验结果进行了分析,从而探讨了Whisper在真实场景中的优缺点。尽管实验表明Whisper达到了最先进的性能水平,但我们发现某些标点符号(如感叹号、分号和冒号)的识别仍有待改进。