Users of Augmentative and Alternative Communication (AAC) may write letter-by-letter via an interface that uses a character language model. However, most state-of-the-art large pretrained language models predict subword tokens of variable length. We investigate how to practically use such models to make accurate and efficient character predictions. We fine-tune models using a large dataset of sentences we curated in which each sentence is rated according to how useful it might be for spoken or written AAC communication. We find that using an algorithm to produce character predictions from a subword large language model provides more accurate predictions than adding a classification layer or using a byte-level model. We also find that our domain adaptation procedure is effective at improving model performance on simple, conversational text.
翻译:增强与替代通信(AAC)用户可能通过采用字符语言模型的界面逐字母书写。然而,当前最先进的大型预训练语言模型通常预测长度可变的子词单元。本研究探讨如何实际运用此类模型实现准确高效的字符预测。我们利用自主构建的大规模语句数据集对模型进行微调,其中每个句子均根据其在口语或书面AAC交流中的潜在效用进行评级。研究发现:通过算法从子词大语言模型生成字符预测,比添加分类层或使用字节级模型能提供更准确的预测结果。同时,本研究的领域自适应方法能有效提升模型在简单会话文本上的性能表现。