Presented is a novel methodology for determining representational structure, which builds upon the existing Spotlight Resonance method. This new tool is used to gain insight into how discrete representations can emerge and organise in autoencoder models, through a controlled ablation study that alters only the activation function. Using this technique, the validity of whether function-driven symmetries can act as implicit inductive biases on representations is determined. Representations are found to tend to discretise when the activation functions are defined through a discrete algebraic permutation-equivariant symmetry. In contrast, they remain continuous under a continuous algebraic orthogonal-equivariant definition. This confirms the hypothesis that the symmetries of network primitives can carry unintended inductive biases, leading to task-independent artefactual structures in representations. The discrete symmetry of contemporary forms is shown to be a strong predictor for the production of symmetry-organised discrete representations emerging from otherwise continuous distributions -- a quantisation effect. This motivates further reassessment of functional forms in common usage due to such unintended consequences. Moreover, this supports a general causal model for a mode in which discrete representations may form, and could constitute a prerequisite for downstream interpretability phenomena, including grandmother neurons, discrete coding schemes, general linear features and a type of Superposition. Hence, this tool and proposed mechanism for the influence of functional form on representations may provide insights into interpretability research. Finally, preliminary results indicate that quantisation of representations correlates with a measurable increase in reconstruction error, reinforcing previous conjectures that this collapse can be detrimental.
翻译:本文提出了一种确定表征结构的新方法,该方法建立在现有Spotlight Resonance方法的基础上。通过一项仅改变激活函数的受控消融研究,这一新工具被用于深入理解离散表征如何在自编码器模型中涌现并组织。利用该技术,我们验证了函数驱动的对称性是否能够作为表征的隐式归纳偏置。研究发现,当激活函数通过离散代数置换等变对称性定义时,表征倾向于离散化;相反,在连续代数正交等变定义下,表征保持连续性。这证实了网络原语的对称性可能携带非预期的归纳偏置,从而导致表征中出现与任务无关的人为结构这一假设。研究证明,当代常用函数形式的离散对称性能够强有力地预测从连续分布中涌现出由对称性组织的离散表征——即一种量化效应。这一发现促使我们基于此类非预期后果,对常用函数形式进行进一步重新评估。此外,该研究支持了一种关于离散表征形成模式的通用因果模型,并可能构成下游可解释性现象(包括祖母神经元、离散编码方案、广义线性特征及一类叠加态)的先决条件。因此,这一工具及所提出的函数形式对表征影响的机制,可能为可解释性研究提供新的见解。最后,初步结果表明,表征的量化与重建误差的显著增加相关,这强化了先前关于这种坍缩可能有害的猜想。