Is preferred tokenization for humans also preferred for machine-learning (ML) models? This study examines the relations between preferred tokenization for humans (appropriateness and readability) and one for ML models (performance on an NLP task). The question texts of the Japanese commonsense question-answering dataset are tokenized with six different tokenizers, and the performances of human annotators and ML models were compared. Furthermore, we analyze relations among performance of answers by human and ML model, the appropriateness of tokenization for human, and response time to questions by human. This study provides a quantitative investigation result that shows that preferred tokenizations for humans and ML models are not necessarily always the same. The result also implies that existing methods using language models for tokenization could be a good compromise both for human and ML models.
翻译:人类偏好的分词方式是否也同样适用于机器学习模型?本研究探讨了人类偏好的分词(适切性与可读性)与机器学习模型偏好的分词(在自然语言处理任务中的性能)之间的关系。我们使用六种不同的分词器对日语常识问答数据集中的问题文本进行分词,并比较了人工标注员与机器学习模型的性能。此外,我们分析了人类与机器学习模型的作答性能、人类对分词适切性的评价以及人类对问题的响应时间之间的关联。本研究提供了量化分析结果,表明人类与机器学习模型偏好的分词方式并非总是一致的。该结果还暗示,当前利用语言模型进行分词的方法可能成为兼顾人类与机器学习模型需求的一种良好折衷方案。