The categorization of massive e-Commerce data is a crucial, well-studied task, which is prevalent in industrial settings. In this work, we aim to improve an existing product categorization model that is already in use by a major web company, serving multiple applications. At its core, the product categorization model is a text classification model that takes a product title as an input and outputs the most suitable category out of thousands of available candidates. Upon a closer inspection, we found inconsistencies in the labeling of similar items. For example, minor modifications of the product title pertaining to colors or measurements majorly impacted the model's output. This phenomenon can negatively affect downstream recommendation or search applications, leading to a sub-optimal user experience. To address this issue, we propose a new framework for consistent text categorization. Our goal is to improve the model's consistency while maintaining its production-level performance. We use a semi-supervised approach for data augmentation and presents two different methods for utilizing unlabeled samples. One method relies directly on existing catalogs, while the other uses a generative model. We compare the pros and cons of each approach and present our experimental results.
翻译:海量电子商务数据的分类是一项关键且被广泛研究的任务,在工业环境中普遍存在。本研究旨在改进一个已被大型网络公司用于多种应用的现有产品分类模型。该产品分类模型本质上是一个文本分类模型,它接收产品标题作为输入,并从数千个候选项中输出最合适的类别。经过仔细检查,我们发现类似项目的标注存在不一致现象。例如,与颜色或尺寸相关的产品标题微小改动会极大影响模型输出。这一现象会对下游推荐或搜索应用产生负面影响,导致次优的用户体验。为解决此问题,我们提出了一种新的一致性文本分类框架。目标是提升模型的一致性,同时保持其生产级性能。我们采用半监督方法进行数据增强,并提出了两种利用无标注样本的不同方法。一种方法直接依赖现有目录,另一种则使用生成模型。我们比较了每种方法的优缺点,并展示了实验结果。