This paper introduces semantic features as a candidate conceptual framework for building inherently interpretable neural networks. A proof of concept model for informative subproblem of MNIST consists of 4 such layers with the total of 5K learnable parameters. The model is well-motivated, inherently interpretable, requires little hyperparameter tuning and achieves human-level adversarial test accuracy - with no form of adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at https://github.com/314-Foundation/white-box-nn
翻译:本文提出了语义特征作为构建内在可解释神经网络的候选概念框架。针对MNIST数据集中一个信息丰富的子问题,概念验证模型包含4个此类层,总计5K个可学习参数。该模型动机明确、本质可解释、几乎无需超参数调优,并在未使用任何对抗训练的情况下达到了人类水平的对抗测试准确率!这些结果及方法的通用性表明,语义特征值得进一步研究。代码可在https://github.com/314-Foundation/white-box-nn获取。