Multi-label classification of open-ended questions with BERT

Open-ended questions in surveys are valuable because they do not constrain the respondent's answer, thereby avoiding biases. However, answers to open-ended questions are text data which are harder to analyze. Traditionally, answers were manually classified as specified in the coding manual. Most of the effort to automate coding has gone into the easier problem of single label prediction, where answers are classified into a single code. However, open-ends that require multi-label classification, i.e., that are assigned multiple codes, occur frequently. This paper focuses on multi-label classification of text answers to open-ended survey questions in social science surveys. We evaluate the performance of the transformer-based architecture BERT for the German language in comparison to traditional multi-label algorithms (Binary Relevance, Label Powerset, ECC) in a German social science survey, the GLES Panel (N=17,584, 55 labels). We find that classification with BERT (forcing at least one label) has the smallest 0/1 loss (13.1%) among methods considered (18.9%-21.6%). As expected, it is much easier to correctly predict answer texts that correspond to a single label (7.1% loss) than those that correspond to multiple labels ($\sim$50% loss). Because BERT predicts zero labels for only 1.5% of the answers, forcing at least one label, while recommended, ultimately does not lower the 0/1 loss by much. Our work has important implications for social scientists: 1) We have shown multi-label classification with BERT works in the German language for open-ends. 2) For mildly multi-label classification tasks, the loss now appears small enough to allow for fully automatic classification (as compared to semi-automatic approaches). 3) Multi-label classification with BERT requires only a single model. The leading competitor, ECC, iterates through individual single label predictions.

翻译：调查问卷中的开放式问题因其不限制受访者答案而避免了偏见，具有重要价值。然而，这些问题的回答作为文本数据较难分析。传统上，研究者需根据编码手册对答案进行手动分类。大多数自动化编码工作聚焦于更简单的单标签预测问题（即每个答案仅分配一个代码），但需要多标签分类（即分配多个代码）的开放式问题频繁出现。本文重点关注社会科学调查中开放式问题文本答案的多标签分类。我们评估了基于Transformer架构的BERT在德语上的表现，并将其与传统多标签算法（Binary Relevance、Label Powerset、ECC）进行对比——数据来自德国社会科学调查GLES面板（样本量N=17,584，55个标签）。研究发现，在考虑的所有方法中（损失率18.9%-21.6%），使用强制至少预测一个标签的BERT进行分类时，其0/1损失最低（13.1%）。与预期一致，正确预测对应单标签的答案文本（损失率7.1%）远比预测对应多标签的答案文本（损失率约50%）容易。尽管BERT仅对1.5%的答案预测为零标签，但强制至少一个标签的策略虽被推荐，实际并未显著降低0/1损失。本研究对社会科学家具有重要启示：（1）我们证明了在德语环境中使用BERT对开放式问题进行多标签分类的有效性；（2）对于轻度多标签分类任务，当前损失已足够低，可支持全自动分类（相较于半自动方法）；（3）基于BERT的多标签分类仅需单个模型，而其主要竞品ECC则需迭代执行多个单标签预测。