Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text - 专知论文

会员服务 ·

0

机器翻译 · Machine Translation · Performer · 特化 · 计算机科学 ·

2023 年 2 月 26 日

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

翻译：探索代码混合型埃及阿拉伯语-英语神经机器翻译的分词方法研究

Marwa Gaser,Manuel Mager,Injy Hamed,Nizar Habash,Slim Abdennadher,Ngoc Thang Vu

from arxiv, Accepted to EACL 2023

Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.

翻译：数据稀疏是代码混合（CS）面临的主要挑战之一，对于形态丰富的语言而言这一挑战更加严峻。在机器翻译（MT）任务中，形态学分词已被证明能有效缓解单语场景下的数据稀疏问题，但尚未针对代码混合环境进行探究。本文研究了不同分词方法对机器翻译性能的影响，涵盖基于形态学和基于频率的分词技术。我们在代码混合型阿拉伯语-英语到英语的翻译任务上进行实验，通过详细分析考察了数据规模、不同代码混合程度的句子等多种条件。实验结果表明，形态感知分词器在分词任务中表现最佳，但在机器翻译中效果欠佳。然而研究发现，机器翻译所采用的分词方案高度依赖数据规模。在极低资源场景下，基于频率与形态学的混合分词方案表现最佳；而在资源较充足的场景中，这种混合方案相比单纯使用频率分词并未带来显著提升。

0

相关内容

机器翻译

机器翻译，又称为自动翻译，是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。它是计算语言学的一个分支，是人工智能的终极目标之一，具有重要的科学研究价值。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

Multi-Task Learning的几篇综述文章

Multi-Task Learning的几篇综述文章

深度学习自然语言处理

15+阅读 · 2020年6月15日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

20克级的水溶性Mn-Cu-In-S磁/光双功能量子点的制备

国家自然科学基金

0+阅读 · 2015年12月31日

Iotrochota属海绵及其内生菌中含氮化合物及其抗肿瘤活性研究

国家自然科学基金

0+阅读 · 2013年12月31日

HOXB-AS3/HOXB7/PAK4信号轴调控结直肠癌侵袭转移的分子机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

SOX9在VKH综合征黑色素脱失中的作用及分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

两个WD40转录因子对银杏类黄酮生物合成调控的研究

国家自然科学基金

0+阅读 · 2012年12月31日

13q染色体末端先天性心脏病致病基因的鉴定及功能研究

国家自然科学基金

0+阅读 · 2012年12月31日

LRRC4-AP2/SP1-miR182-LRRC4和LRRC4-miR-185-DNMT1-LRRC4调控环路在脑胶质瘤中相互调控的机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

神经元凋亡时Egr1对BH3-only蛋白Bim的转录调控

国家自然科学基金

0+阅读 · 2009年12月31日

转录因子对丹酚酸类化合物生物合成调控研究

国家自然科学基金

0+阅读 · 2009年12月31日

生物催化氧化还原途径中的遗传多样性研究

国家自然科学基金

0+阅读 · 2009年12月31日

A Survey of Learning on Small Data

Arxiv

19+阅读 · 2022年7月29日

Sequence Level Contrastive Learning for Text Summarization

Sequence Level Contrastive Learning for Text Summarization

Arxiv

14+阅读 · 2021年9月24日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Arxiv

30+阅读 · 2021年7月28日

Contrastive learning of global and local features for medical image segmentation with limited annotations

Arxiv

19+阅读 · 2020年6月18日

A Survey of Methods for Low-Power Deep Learning and Computer Vision

A Survey of Methods for Low-Power Deep Learning and Computer Vision

Arxiv

14+阅读 · 2020年3月24日

Image Segmentation Using Deep Learning: A Survey

Image Segmentation Using Deep Learning: A Survey

Arxiv

47+阅读 · 2020年1月15日

A Survey of Model Compression and Acceleration for Deep Neural Networks

Arxiv

67+阅读 · 2019年9月8日

Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

Arxiv

11+阅读 · 2018年6月16日

Matching Networks for One Shot Learning

Arxiv

10+阅读 · 2017年12月29日

VIP会员

文章信息

相关主题

Machine Translation

计算机科学

最新内容

【伯克利博士论文】基于动作分块策略的强化学习

【伯克利博士论文】基于动作分块策略的强化学习

专知会员服务

1+阅读 · 46分钟前

Transformer增强强化学习：通信网络基础与应用综述

Transformer增强强化学习：通信网络基础与应用综述

专知会员服务

1+阅读 · 49分钟前

ICML 2026 | SARDI：扩散语言模型的自增强检索

ICML 2026 | SARDI：扩散语言模型的自增强检索

专知会员服务

5+阅读 · 6月6日

长时程具身智能安全综述：机器人操作的跨层分析

长时程具身智能安全综述：机器人操作的跨层分析

专知会员服务

7+阅读 · 6月6日

从“杀伤链”到“杀伤网”：新时代防空反导体系的真正需求

从“杀伤链”到“杀伤网”：新时代防空反导体系的真正需求

专知会员服务

12+阅读 · 6月6日

《锻造军官能力：军官发展的军事训练、学术教育及设计思维导向创新的多维度研究》最新300页

《锻造军官能力：军官发展的军事训练、学术教育及设计思维导向创新的多维度研究》最新300页

专知会员服务

7+阅读 · 6月6日

《国防领域安全采用大语言模型的战略蓝图》

《国防领域安全采用大语言模型的战略蓝图》

专知会员服务

9+阅读 · 6月6日

《对抗性电磁环境下远程巡飞弹作战的保密指挥控制数据链》

《对抗性电磁环境下远程巡飞弹作战的保密指挥控制数据链》

专知会员服务

9+阅读 · 6月6日

CVPR2026奖项公布，谷歌D4RT最佳论文获奖，何恺明ResNet、YOLO获时间检验奖！

CVPR2026奖项公布，谷歌D4RT最佳论文获奖，何恺明ResNet、YOLO获时间检验奖！

专知会员服务

7+阅读 · 6月6日

ICML 2026 | 演化选择的因果建模

ICML 2026 | 演化选择的因果建模

专知会员服务

9+阅读 · 6月5日

综述｜学习式3D表征最新进展与趋势

综述｜学习式3D表征最新进展与趋势

专知会员服务

7+阅读 · 6月5日

《武器作战效能分析：基于虚拟构造仿真大数据与深度学习的初步见解》

《武器作战效能分析：基于虚拟构造仿真大数据与深度学习的初步见解》

专知会员服务

10+阅读 · 6月5日

《自主巡飞弹药系统量子逻辑框架：一种基于不确定模糊集的方法》

《自主巡飞弹药系统量子逻辑框架：一种基于不确定模糊集的方法》

专知会员服务

7+阅读 · 6月5日

人工智能重塑威慑：算法优势的兴起

人工智能重塑威慑：算法优势的兴起

专知会员服务

9+阅读 · 6月5日

【博士论文】基于物理结构与贝叶斯不确定性的可靠神经网络

【博士论文】基于物理结构与贝叶斯不确定性的可靠神经网络

专知会员服务

14+阅读 · 6月4日

相关VIP内容

NeurlPS 2022 | 自然语言处理相关论文分类整理

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

Transformer增强强化学习：通信网络基础与应用综述

长时程具身智能安全综述：机器人操作的跨层分析

【伯克利博士论文】基于动作分块策略的强化学习

ICML 2026 | SARDI：扩散语言模型的自增强检索

相关资讯

Multi-Task Learning的几篇综述文章

Multi-Task Learning的几篇综述文章

深度学习自然语言处理

15+阅读 · 2020年6月15日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

44+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

【推荐】(TensorFlow)SSD实时手部检测与追踪（附代码）

机器学习研究会

11+阅读 · 2017年12月5日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Capsule Networks解析

Capsule Networks解析

机器学习研究会

11+阅读 · 2017年11月12日

相关论文

A Survey of Learning on Small Data

Arxiv

19+阅读 · 2022年7月29日

Sequence Level Contrastive Learning for Text Summarization

Sequence Level Contrastive Learning for Text Summarization

Arxiv

14+阅读 · 2021年9月24日

Learning Neural Models for Natural Language Processing in the Face of Distributional Shift

Arxiv

11+阅读 · 2021年9月3日

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

Arxiv

30+阅读 · 2021年7月28日

Contrastive learning of global and local features for medical image segmentation with limited annotations

Arxiv

19+阅读 · 2020年6月18日

A Survey of Methods for Low-Power Deep Learning and Computer Vision

A Survey of Methods for Low-Power Deep Learning and Computer Vision

Arxiv

14+阅读 · 2020年3月24日

Image Segmentation Using Deep Learning: A Survey

Image Segmentation Using Deep Learning: A Survey

Arxiv

47+阅读 · 2020年1月15日

A Survey of Model Compression and Acceleration for Deep Neural Networks

Arxiv

67+阅读 · 2019年9月8日

Multimodal Sentiment Analysis using Hierarchical Fusion with Context Modeling

Arxiv

11+阅读 · 2018年6月16日

Matching Networks for One Shot Learning

Arxiv

10+阅读 · 2017年12月29日

相关基金

20克级的水溶性Mn-Cu-In-S磁/光双功能量子点的制备

国家自然科学基金

0+阅读 · 2015年12月31日

Iotrochota属海绵及其内生菌中含氮化合物及其抗肿瘤活性研究

国家自然科学基金

0+阅读 · 2013年12月31日

HOXB-AS3/HOXB7/PAK4信号轴调控结直肠癌侵袭转移的分子机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

SOX9在VKH综合征黑色素脱失中的作用及分子机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

两个WD40转录因子对银杏类黄酮生物合成调控的研究

国家自然科学基金

0+阅读 · 2012年12月31日

13q染色体末端先天性心脏病致病基因的鉴定及功能研究

国家自然科学基金

0+阅读 · 2012年12月31日

LRRC4-AP2/SP1-miR182-LRRC4和LRRC4-miR-185-DNMT1-LRRC4调控环路在脑胶质瘤中相互调控的机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

神经元凋亡时Egr1对BH3-only蛋白Bim的转录调控

国家自然科学基金

0+阅读 · 2009年12月31日

转录因子对丹酚酸类化合物生物合成调控研究

国家自然科学基金

0+阅读 · 2009年12月31日

生物催化氧化还原途径中的遗传多样性研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员