Is preferred tokenization for humans also preferred for machine-learning (ML) models? This study examines the relations between preferred tokenization for humans (appropriateness and readability) and one for ML models (performance on an NLP task). The question texts of the Japanese commonsense question-answering dataset are tokenized with six different tokenizers, and the performances of human annotators and ML models were compared. Furthermore, we analyze relations among performance of answers by human and ML model, the appropriateness of tokenization for human, and response time to questions by human. This study provides a quantitative investigation result that shows that preferred tokenizations for humans and ML models are not necessarily always the same. The result also implies that existing methods using language models for tokenization could be a good compromise both for human and ML models.
翻译:人类偏好的标记化方式是否也适用于机器学习(ML)模型?本研究探讨了人类偏好的标记化(适切性与可读性)与ML模型偏好的标记化(在自然语言处理任务上的性能)之间的关系。我们使用六种不同的标记器对日本常识问答数据集的问题文本进行标记化处理,并比较了人类标注者与ML模型的性能。此外,我们分析了人类与ML模型回答性能、标记化对人类适切性以及人类回答问题时间之间的关系。本研究提供的定量调查结果表明,人类与ML模型偏好的标记化并非总是一致。研究结果还暗示,现有使用语言模型进行标记化的方法可能成为兼顾人类与ML模型需求的一种良好折衷方案。