Dropout is a representative regularization technique that stochastically deactivates hidden units during training to mitigate overfitting. In contrast, standard inference executes the full network with dense computation, so its goal and mechanism differ from conditional computation, where the executed operations depend on the input. This paper organizes DynamicGate-MLP into a single framework that simultaneously satisfies both the regularization view and the conditional-computation view. Instead of a random mask, the proposed model learns gates that decide whether to use each unit (or block), suppressing unnecessary computation while implementing sample-dependent execution that concentrates computation on the parts needed for each input. To this end, we define continuous gate probabilities and, at inference time, generate a discrete execution mask from them to select an execution path. Training controls the compute budget via a penalty on expected gate usage and uses a Straight-Through Estimator (STE) to optimize the discrete mask. We evaluate DynamicGate-MLP on MNIST, CIFAR-10, Tiny-ImageNet, Speech Commands, and PBMC3k, and compare it with various MLP baselines and MoE-style variants. Compute efficiency is compared under a consistent criterion using gate activation ratios and a layerweighted relative MAC metric, rather than wall-clock latency that depends on hardware and backend kernels.
翻译:Dropout是一种典型的正则化技术,它在训练过程中随机停用隐藏单元以缓解过拟合。相比之下,标准推理过程执行的是全网络的密集计算,因此其目标与机制不同于条件计算——后者的执行操作取决于输入。本文提出的DynamicGate-MLP整合为一个统一框架,同时满足正则化视角与条件计算视角的需求。该模型不采用随机掩码,而是学习决定是否使用每个单元(或模块)的门控机制,在实现样本依赖执行的同时抑制不必要的计算,从而将计算资源集中于每个输入所需的部分。为此,我们定义了连续的门控概率,并在推理时据此生成离散执行掩码以选择计算路径。训练过程通过对期望门控使用率的惩罚来控制计算预算,并采用直通估计器(STE)优化离散掩码。我们在MNIST、CIFAR-10、Tiny-ImageNet、Speech Commands和PBMC3k数据集上评估DynamicGate-MLP,并与多种MLP基线及MoE风格变体进行比较。计算效率的比较采用统一标准,通过门控激活率与层加权相对MAC度量进行评估,而非依赖硬件和后端内核的实时延迟。