The Tsetlin Machine (TM) architecture has recently demonstrated effectiveness in Machine Learning (ML), particularly within Natural Language Processing (NLP). It has been utilized to construct word embedding using conjunctive propositional clauses, thereby significantly enhancing our understanding and interpretation of machine-derived decisions. The previous approach performed the word embedding over a sequence of input words to consolidate the information into a cohesive and unified representation. However, that approach encounters scalability challenges as the input size increases. In this study, we introduce a novel approach incorporating two-phase training to discover contextual embeddings of input sequences. Specifically, this method encapsulates the knowledge for each input word within the dataset's vocabulary, subsequently constructing embeddings for a sequence of input words utilizing the extracted knowledge. This technique not only facilitates the design of a scalable model but also preserves interpretability. Our experimental findings revealed that the proposed method yields competitive performance compared to the previous approaches, demonstrating promising results in contrast to human-generated benchmarks. Furthermore, we applied the proposed approach to sentiment analysis on the IMDB dataset, where the TM embedding and the TM classifier, along with other interpretable classifiers, offered a transparent end-to-end solution with competitive performance.
翻译:Tsetlin Machine(TM)架构近期在机器学习(ML)领域,特别是在自然语言处理(NLP)中,展现了显著的有效性。该架构已被用于通过合取命题子句构建词嵌入,从而显著增强了对机器决策的理解与解释能力。先前的方法通过对输入词序列执行词嵌入,将信息整合为连贯统一的表示形式。然而,随着输入规模的增大,该方法面临可扩展性挑战。本研究提出了一种结合两阶段训练的新方法,以发现输入序列的上下文嵌入。具体而言,该方法首先将数据集中词汇表内每个输入词的知识进行封装,随后利用提取的知识为输入词序列构建嵌入表示。此技术不仅有助于设计可扩展的模型,同时保持了可解释性。实验结果表明,与先前方法相比,所提出的方法具有竞争性的性能,相较于人工生成的基准测试亦展现出良好的效果。此外,我们将所提出的方法应用于IMDB数据集的情感分析任务,其中TM嵌入与TM分类器以及其他可解释分类器共同提供了一个透明的端到端解决方案,并取得了具有竞争力的性能。