Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh

We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.

翻译：我们介绍了"多语言云语料库"，这是孟加拉国首个国家级、平行、多模态的少数民族及原住民语言数据集。尽管孟加拉国拥有约40种分属四个语系的少数民族语言，其中14种被列为濒危语言，但这些主要依靠口传、在计算语言学上属于"零资源"的语言变体，此前一直缺乏系统性的跨语系数字语料库。本语料库包含85792条结构化文本条目，每条均包含孟加拉语刺激文本、英语翻译及国际音标转写，并附有约107小时的转录音频，覆盖了藏缅语系、印欧语系、南亚语系和达罗毗荼语系的42种语言变体，以及两种谱系未分类语言。数据通过为期90天、覆盖孟加拉国九个地区的系统性田野调查收集完成，共涉及16名数据采集员、77名发音人和43名验证者，采用包含2224个独立项目的预定义诱发模板，该模板按三个语言粒度层级组织：孤立词汇项（涵盖22个语义域的475个单词）、语法结构（涵盖21个类别包括动词变位范式的887个句子）以及引导性话语（涵盖46个对话场景的862个提示）。田野调查后处理包括由10名语言学家完成的国际音标注音，并由6名评审员进行独立裁定。完整数据集通过多语言云平台（multiling.cloud）公开提供，支持对所有已记录语言变体的标注音频和文本数据进行检索。本文详细阐述了语料库设计、田野调查方法、数据集结构及各语言覆盖情况，并探讨了其对濒危语言记录、低资源自然语言处理以及语言多样化发展中国家的数字保存工作的意义。