Free-style text is still one of the common ways in which data is registered in real environments, like legal procedures and medical records. Because of that, there have been significant efforts in the area of natural language processing to convert these texts into a structured format, which standard machine learning methods can then exploit. One of the most popular methods to embed text into a vectorial representation is the Contrastive Language-Image Pre-training model (CLIP), which was trained using both image and text. Although the representations computed by CLIP have been very successful in zero-show and few-shot learning problems, they still have problems when applied to a particular domain. In this work, we use a fuzzy rule-based classification system along with some standard text procedure techniques to map some of our features of interest to the space created by a CLIP model. Then, we discuss the rules and associations obtained and the importance of each feature considered. We apply this approach in two different data domains, clinical reports and film reviews, and compare the results obtained individually and when considering both. Finally, we discuss the limitations of this approach and how it could be further improved.
翻译:自由文本仍然是现实环境中数据录入的常见方式之一,例如法律程序和医疗记录。因此,自然语言处理领域已投入大量努力将这些文本转换为结构化格式,以便标准机器学习方法能够加以利用。将文本嵌入向量表示的最流行方法之一是对比语言-图像预训练模型(CLIP),该模型同时使用图像和文本进行训练。尽管CLIP计算出的表示在零样本和少样本学习问题上取得了显著成功,但在应用于特定领域时仍存在问题。本研究采用基于模糊规则的分类系统,结合标准文本处理技术,将我们感兴趣的部分特征映射到CLIP模型创建的空间中。随后,我们讨论了所获得的规则与关联关系,以及各特征的重要性。我们将此方法应用于两个不同的数据领域——临床报告和电影评论,分别比较单独考虑各领域及同时考虑两个领域时获得的结果。最后,我们讨论了该方法的局限性及其改进方向。