Social media offer plenty of information to perform market research in order to meet the requirements of customers. One way how this research is conducted is that a domain expert gathers and categorizes user-generated content into a complex and fine-grained class structure. In many of such cases, little data meets complex annotations. It is not yet fully understood how this can be leveraged successfully for classification. We examine the classification accuracy of expert labels when used with a) many fine-grained classes and b) few abstract classes. For scenario b) we compare abstract class labels given by the domain expert as baseline and by automatic hierarchical clustering. We compare this to another baseline where the entire class structure is given by a completely unsupervised clustering approach. By doing so, this work can serve as an example of how complex expert annotations are potentially beneficial and can be utilized in the most optimal way for opinion mining in highly specific domains. By exploring across a range of techniques and experiments, we find that automated class abstraction approaches in particular the unsupervised approach performs remarkably well against domain expert baseline on text classification tasks. This has the potential to inspire opinion mining applications in order to support market researchers in practice and to inspire fine-grained automated content analysis on a large scale.
翻译:社交媒体为开展市场研究提供了丰富的信息,以满足客户需求。一种常见的研究方式是领域专家将用户生成内容归类为复杂且细粒度的类别结构。在许多此类案例中,少量数据对应着复杂的标注体系,目前尚未完全理解如何有效利用这种特性进行分类。我们研究了专家标注在以下两种场景中的分类准确率:a) 采用大量细粒度类别;b) 采用少量抽象类别。针对场景b),我们将领域专家给出的抽象类别标签作为基线,并与自动层次聚类方法进行对比。此外,我们还将完整的类别结构与完全无监督聚类方法给出的基线进行对比。通过这种方式,本研究可示例说明复杂专家标注如何在高度特定领域的意见挖掘中发挥潜在优势,并得以最优方式利用。通过一系列技术与实验探索,我们发现自动类别抽象方法(尤其是无监督方法)在文本分类任务中的表现显著优于领域专家基线。这项工作有望启发意见挖掘应用,以支持市场研究者的实际工作,并推动大规模细粒度自动化内容分析的发展。