Pinyin to Character conversion (P2C) task is the key task of Input Method Engine (IME) in commercial input software for Asian languages, such as Chinese, Japanese, Thai language and so on. It's usually treated as sequence labelling task and resolved by language model, i.e. n-gram or RNN. However, the low capacity of the n-gram or RNN limits its performance. This paper introduces a new solution named PERT which stands for bidirectional Pinyin Encoder Representations from Transformers. It achieves significant improvement of performance over baselines. Furthermore, we combine PERT with n-gram under a Markov framework, and improve performance further. Lastly, the external lexicon is incorporated into PERT so as to resolve the OOD issue of IME.
翻译:Pininin 转换为字符(P2C)是亚洲语言(如中文、日文、泰文等)商业投入软件中输入法引擎的关键任务,通常被视为序列标签任务,由语言模式(即n-gram或RNN)解决。然而,n-gram或RNN的低容量限制了其性能。本文引入了名为PERT的新解决方案,它代表来自变异器的双向 Pininin Encoder 演示。它大大改善了基线的性能。此外,我们在Markov 框架下将PERT与n-gram结合起来,并进一步提高性能。最后,外部词汇被纳入 PERT,以便解决IME的 OOD问题。