This paper documents our efforts in releasing the printed and audio book of the translation of the famous novel The Little Prince into the Chakavian dialect, as a computer-readable, AI-ready dataset, with the textual and the audio components of the two releases now aligned on the level of each written and spoken word. Our motivation for working on this release is multiple. The first one is our wish to preserve the highly valuable and specific content beyond the small editions of the printed and the audio book. With the dataset published in the CLARIN.SI repository, this content is from now on at the fingertips of any interested individual. The second motivation is to make the data available for various artificial-intelligence-related usage scenarios, such as the one we follow upon inside this paper already -- adapting the Whisper-large-v3 open automatic speech recognition model, with decent performance on standard Croatian, to Chakavian dialectal speech. We can happily report that with adapting the model, the word error rate on the selected test data has being reduced to a half, while we managed to remove up to two thirds of the error on character level. We envision many more usages of this dataset beyond the set of experiments we have already performed, both on tasks of artificial intelligence research and application, as well as dialectal research. The third motivation for this release is our hope that this, now highly structured dataset, will be transformed into a digital online edition of this work, allowing individuals beyond the research and technology communities to enjoy the beauty of the message of the little boy in the desert, told through the spectacular prism of the Chakavian dialect.
翻译:本文记录了我们如何将著名小说《小王子》的查克方言译本印刷版与有声书,以计算机可读、人工智能就绪的数据集形式发布,并实现了两个版本在书面与口语词汇层面上的对齐。我们开展此项工作的动机是多方面的。首先,我们希望将这些极具价值且内容独特的资料,从有限的印刷版和有声书版本中保存下来。通过将数据集发布于CLARIN.SI知识库,这些内容从此可供任何感兴趣的人便捷获取。其次,我们旨在使数据能够支持多种人工智能相关应用场景,例如本文中已开展的工作——将Whisper-large-v3开源自动语音识别模型(在标准克罗地亚语上表现良好)适配至查克方言语音。我们欣喜地报告,通过模型适配,在选定测试数据上的词错误率降低了一半,同时在字符级别上成功减少了高达三分之二的错误。我们预见该数据集除已进行的实验外,还将在人工智能研究与应用任务以及方言研究领域拥有更广泛的应用前景。第三,我们希望通过此次发布,使这一高度结构化的数据集能够转化为该作品的数字在线版本,让研究和技术社区之外的更多人,能够透过查克方言这一壮丽棱镜,领略沙漠中小男孩所传递的美好讯息。