The categorization of massive e-Commerce data is a crucial, well-studied task, which is prevalent in industrial settings. In this work, we aim to improve an existing product categorization model that is already in use by a major web company, serving multiple applications. At its core, the product categorization model is a text classification model that takes a product title as an input and outputs the most suitable category out of thousands of available candidates. Upon a closer inspection, we found inconsistencies in the labeling of similar items. For example, minor modifications of the product title pertaining to colors or measurements majorly impacted the model's output. This phenomenon can negatively affect downstream recommendation or search applications, leading to a sub-optimal user experience. To address this issue, we propose a new framework for consistent text categorization. Our goal is to improve the model's consistency while maintaining its production-level performance. We use a semi-supervised approach for data augmentation and presents two different methods for utilizing unlabeled samples. One method relies directly on existing catalogs, while the other uses a generative model. We compare the pros and cons of each approach and present our experimental results.
翻译:大规模电子商务数据的分类是一项至关重要且经过充分研究的任务,在工业环境中普遍存在。在这项工作中,我们旨在改进一个已被一家大型网络公司用于多个应用场景的现有产品分类模型。该产品分类模型的核心是一个文本分类模型,它以产品标题为输入,并从数千个可用候选中输出最合适的类别。经过仔细检查,我们发现相似物品的标签存在不一致性。例如,对产品标题中涉及颜色或尺寸的微小修改会显著影响模型的输出。这种现象可能会对下游的推荐或搜索应用产生负面影响,导致用户体验欠佳。为解决此问题,我们提出了一个用于一致文本分类的新框架。我们的目标是提升模型的一致性,同时保持其生产级别的性能。我们采用半监督方法进行数据增强,并提出了两种利用未标记样本的不同方法:一种方法直接依赖于现有目录,另一种方法则使用生成模型。我们比较了每种方法的优缺点,并展示了实验结果。