Mići Princ -- A Little Boy Teaching Speech Technologies the Chakavian Dialect

This paper documents our efforts in releasing the printed and audio book of the translation of the famous novel The Little Prince into the Chakavian dialect, as a computer-readable, AI-ready dataset, with the textual and the audio components of the two releases now aligned on the level of each written and spoken word. Our motivation for working on this release is multiple. The first one is our wish to preserve the highly valuable and specific content beyond the small editions of the printed and the audio book. With the dataset published in the CLARIN.SI repository, this content is from now on at the fingertips of any interested individual. The second motivation is to make the data available for various artificial-intelligence-related usage scenarios, such as the one we follow upon inside this paper already -- adapting the Whisper-large-v3 open automatic speech recognition model, with decent performance on standard Croatian, to Chakavian dialectal speech. We can happily report that with adapting the model, the word error rate on the selected test data has being reduced to a half, while we managed to remove up to two thirds of the error on character level. We envision many more usages of this dataset beyond the set of experiments we have already performed, both on tasks of artificial intelligence research and application, as well as dialectal research. The third motivation for this release is our hope that this, now highly structured dataset, will be transformed into a digital online edition of this work, allowing individuals beyond the research and technology communities to enjoy the beauty of the message of the little boy in the desert, told through the spectacular prism of the Chakavian dialect.

翻译：本文记录了我们如何将著名小说《小王子》的查克方言译本印刷版与有声书，以计算机可读、人工智能就绪的数据集形式发布，并实现了两个版本在书面与口语词汇层面上的对齐。我们开展此项工作的动机是多方面的。首先，我们希望将这些极具价值且内容独特的资料，从有限的印刷版和有声书版本中保存下来。通过将数据集发布于CLARIN.SI知识库，这些内容从此可供任何感兴趣的人便捷获取。其次，我们旨在使数据能够支持多种人工智能相关应用场景，例如本文中已开展的工作——将Whisper-large-v3开源自动语音识别模型（在标准克罗地亚语上表现良好）适配至查克方言语音。我们欣喜地报告，通过模型适配，在选定测试数据上的词错误率降低了一半，同时在字符级别上成功减少了高达三分之二的错误。我们预见该数据集除已进行的实验外，还将在人工智能研究与应用任务以及方言研究领域拥有更广泛的应用前景。第三，我们希望通过此次发布，使这一高度结构化的数据集能够转化为该作品的数字在线版本，让研究和技术社区之外的更多人，能够透过查克方言这一壮丽棱镜，领略沙漠中小男孩所传递的美好讯息。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

涵盖大模型，斯坦福大学《语音与语言处理》最新版：NLP必读书籍，599页pdf

专知会员服务

67+阅读 · 2024年3月24日

【开放书】《面向自然语言处理的表示学习》，清华大学，Representation Learning for Natural Language Processing

专知会员服务

37+阅读 · 2022年3月24日

653页PDF，含PPT，斯坦福大学、科罗拉多大学最新【语音与语言处理】书稿《语音与语言处理:自然语言处理、计算语言学与语音识别概论 "Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition》

专知会员服务

47+阅读 · 2022年2月25日