基于mBART的文本到手语注释翻译前沿研究：以孟加拉语为例 (State-of-the-Art Translation of Text-to-Gloss using mBART : A case study of Bangla)

Despite a large deaf and dumb population of 1.7 million, Bangla Sign Language (BdSL) remains a understudied domain. Specifically, there are no works on Bangla text-to-gloss translation task. To address this gap, we begin by addressing the dataset problem. We take inspiration from grammatical rule based gloss generation used in Germany and American sign langauage (ASL) and adapt it for BdSL. We also leverage LLM to generate synthetic data and use back-translation, text generation for data augmentation. With dataset prepared, we started experimentation. We fine-tuned pretrained mBART-50 and mBERT-multiclass-uncased model on our dataset. We also trained GRU, RNN and a novel seq-to-seq model with multi-head attention. We observe significant high performance (ScareBLEU=79.53) with fine-tuning pretrained mBART-50 multilingual model from Facebook. We then explored why we observe such high performance with mBART. We soon notice an interesting property of mBART -- it was trained on shuffled and masked text data. And as we know, gloss form has shuffling property. So we hypothesize that mBART is inherently good at text-to-gloss tasks. To find support against this hypothesis, we trained mBART-50 on PHOENIX-14T benchmark and evaluated it with existing literature. Our mBART-50 finetune demonstrated State-of-the-Art performance on PHOENIX-14T benchmark, far outperforming existing models in all 6 metrics (ScareBLEU = 63.89, BLEU-1 = 55.14, BLEU-2 = 38.07, BLEU-3 = 27.13, BLEU-4 = 20.68, COMET = 0.624). Based on the results, this study proposes a new paradigm for text-to-gloss task using mBART models. Additionally, our results show that BdSL text-to-gloss task can greatly benefit from rule-based synthetic dataset.

翻译：尽管孟加拉国有170万庞大的聋哑人口，孟加拉手语（BdSL）仍是一个研究不足的领域。具体而言，目前尚无针对孟加拉语文本到手语注释翻译任务的研究。为填补这一空白，我们首先着手解决数据集问题。我们借鉴了德国手语和美国手语（ASL）中基于语法规则的手语注释生成方法，并将其适配于BdSL。同时，我们利用大语言模型生成合成数据，并采用回译和文本生成技术进行数据增强。数据集准备就绪后，我们开始实验。我们在自建数据集上对预训练的mBART-50和mBERT-multiclass-uncased模型进行微调，还训练了GRU、RNN以及一种新颖的带多头注意力机制的序列到序列模型。实验发现，对Facebook发布的预训练多语言模型mBART-50进行微调后，其性能表现显著优异（ScareBLEU=79.53）。随后我们深入探究了mBART取得如此高性能的原因。我们很快注意到mBART的一个有趣特性——该模型是在经过乱序和掩码处理的文本数据上训练的。而众所周知，手语注释形式具有乱序特性。因此我们提出假设：mBART本质上擅长文本到手语注释的翻译任务。为验证该假设，我们在PHOENIX-14T基准数据集上训练mBART-50，并与现有文献结果进行对比评估。我们的微调版mBART-50在PHOENIX-14T基准测试中展现了最先进的性能，在所有6项指标上均大幅超越现有模型（ScareBLEU = 63.89，BLEU-1 = 55.14，BLEU-2 = 38.07，BLEU-3 = 27.13，BLEU-4 = 20.68，COMET = 0.624）。基于这些结果，本研究提出了使用mBART模型进行文本到手语注释翻译任务的新范式。此外，我们的结果表明基于规则的合成数据集能极大促进BdSL文本到手语注释翻译任务的发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日