With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has gained significant attention in Chinese Classical Studies (CCS). While existing research primarily focuses on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we introduce the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA), a 119-hour corpus comprising 22,000 audio samples. It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current MLLMs still face substantial challenges on the MCGA test set. Furthermore, we introduce a domain-specific metric for SEC and a metric to measure the consistency between speech and text capabilities. We release MCGA to the public to facilitate the development of more robust MLLMs. MCGA Corpus: https://github.com/yxduir/MCGA
翻译:随着多模态大语言模型(MLLMs)的快速发展,其在中国古典研究领域的潜力受到广泛关注。尽管现有研究主要集中在文本与视觉模态,该领域的音频语料仍未得到充分探索。为弥补这一空白,我们提出了多任务中国古典文学体裁音频语料库(MCGA),这是一个包含22,000个音频样本、总时长119小时的语料库。该语料库涵盖六项任务的多样化文学体裁:自动语音识别(ASR)、语音到文本翻译(S2TT)、语音情感描述(SEC)、口语问答(SQA)、语音理解(SU)及语音推理(SR)。通过对十个MLLMs的评估,实验结果表明当前MLLMs在MCGA测试集上仍面临显著挑战。此外,我们引入了针对SEC的领域特定指标,以及衡量语音与文本能力一致性的度量标准。我们公开发布MCGA以促进更鲁棒的MLLMs的发展。MCGA语料库:https://github.com/yxduir/MCGA