Consistent Text Categorization using Data Augmentation in e-Commerce

The categorization of massive e-Commerce data is a crucial, well-studied task, which is prevalent in industrial settings. In this work, we aim to improve an existing product categorization model that is already in use by a major web company, serving multiple applications. At its core, the product categorization model is a text classification model that takes a product title as an input and outputs the most suitable category out of thousands of available candidates. Upon a closer inspection, we found inconsistencies in the labeling of similar items. For example, minor modifications of the product title pertaining to colors or measurements majorly impacted the model's output. This phenomenon can negatively affect downstream recommendation or search applications, leading to a sub-optimal user experience. To address this issue, we propose a new framework for consistent text categorization. Our goal is to improve the model's consistency while maintaining its production-level performance. We use a semi-supervised approach for data augmentation and presents two different methods for utilizing unlabeled samples. One method relies directly on existing catalogs, while the other uses a generative model. We compare the pros and cons of each approach and present our experimental results.

翻译：海量电子商务数据的分类是一项关键且被广泛研究的任务，在工业环境中普遍存在。本研究旨在改进一个已被大型网络公司用于多种应用的现有产品分类模型。该产品分类模型本质上是一个文本分类模型，它接收产品标题作为输入，并从数千个候选项中输出最合适的类别。经过仔细检查，我们发现类似项目的标注存在不一致现象。例如，与颜色或尺寸相关的产品标题微小改动会极大影响模型输出。这一现象会对下游推荐或搜索应用产生负面影响，导致次优的用户体验。为解决此问题，我们提出了一种新的一致性文本分类框架。目标是提升模型的一致性，同时保持其生产级性能。我们采用半监督方法进行数据增强，并提出了两种利用无标注样本的不同方法。一种方法直接依赖现有目录，另一种则使用生成模型。我们比较了每种方法的优缺点，并展示了实验结果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/