Automated classification for open-ended questions with BERT

Manual coding of text data from open-ended questions into different categories is time consuming and expensive. Automated coding uses statistical/machine learning to train on a small subset of manually coded text answers. Recently, pre-training a general language model on vast amounts of unrelated data and then adapting the model to the specific application has proven effective in natural language processing. Using two data sets, we empirically investigate whether BERT, the currently dominant pre-trained language model, is more effective at automated coding of answers to open-ended questions than other non-pre-trained statistical learning approaches. We found fine-tuning the pre-trained BERT parameters is essential as otherwise BERT's is not competitive. Second, we found fine-tuned BERT barely beats the non-pre-trained statistical learning approaches in terms of classification accuracy when trained on 100 manually coded observations. However, BERT's relative advantage increases rapidly when more manually coded observations (e.g. 200-400) are available for training. We conclude that for automatically coding answers to open-ended questions BERT is preferable to non-pretrained models such as support vector machines and boosting.

翻译：开放式问题文本数据的人工分类编码耗时且昂贵。自动编码通过统计/机器学习方法，在人工编码的文本答案子集上进行训练。近年来，在大量无关数据上预训练通用语言模型并针对具体应用进行微调，已被证明在自然语言处理中效果显著。基于两个数据集，我们实证研究了当前主流预训练语言模型BERT是否比其他非预训练统计学习方法更有效地实现开放式问题答案的自动编码。研究发现：首先，对预训练BERT参数进行微调至关重要，否则BERT将不具备竞争力；其次，当使用100个人工编码观测值训练时，微调后的BERT在分类准确率上仅略优于非预训练统计学习方法。然而，当可用的训练观测值增至200-400个时，BERT的相对优势迅速提升。我们得出结论：在对开放式问题答案进行自动编码时，BERT优于支持向量机、提升方法等非预训练模型。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日