With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: https://github.com/yxduir/MCGA
翻译:随着多模态大语言模型(MLLMs)的快速发展,其在中华古典研究(CCS)领域的潜力已引起广泛关注。尽管现有研究主要集中在文本和视觉模态,但该领域内的音频语料库在很大程度上仍未得到充分探索。为填补这一空白,我们提出了多任务文言文学体裁音频语料库(MCGA)。该语料库涵盖了多种文学体裁,并包含六项任务:自动语音识别(ASR)、语音到文本翻译(S2TT)、语音情感描述(SEC)、口语问答(SQA)、语音理解(SU)和语音推理(SR)。通过对十种MLLMs进行评估,我们的实验结果表明,当前模型在处理MCGA测试集时仍面临重大挑战。此外,我们为SEC引入了一种评估指标,并提出了一种用于衡量MLLMs语音与文本能力一致性的指标。我们将MCGA及代码公开发布,以促进在CCS领域开发具备更强大、更全面的多维音频能力的MLLMs。MCGA语料库:https://github.com/yxduir/MCGA