Does the dominant approach to learn representations (as a side effect of optimizing an expected cost for a single training distribution) remain a good approach when we are dealing with multiple distributions? Our thesis is that such scenarios are better served by representations that are richer than those obtained with a single optimization episode. We support this thesis with simple theoretical arguments and with experiments utilizing an apparently na\"{\i}ve ensembling technique: concatenating the representations obtained from multiple training episodes using the same data, model, algorithm, and hyper-parameters, but different random seeds. These independently trained networks perform similarly. Yet, in a number of scenarios involving new distributions, the concatenated representation performs substantially better than an equivalently sized network trained with a single training run. This proves that the representations constructed by multiple training episodes are in fact different. Although their concatenation carries little additional information about the training task under the training distribution, it becomes substantially more informative when tasks or distributions change. Meanwhile, a single training episode is unlikely to yield such a redundant representation because the optimization process has no reason to accumulate features that do not incrementally improve the training performance.
翻译:当前主流的表征学习方法(通过优化单训练分布下期望成本的副作用来学习表征)在处理多重分布时是否仍然有效?我们的论点是,此类场景更适合采用比单次优化过程更丰富的表征。我们通过简单的理论论证和实验来支持这一论点——实验采用一种看似朴素的集成技术:使用相同的数据、模型、算法和超参数但不同随机种子,将多次训练过程所得的表示进行拼接。这些独立训练的网络性能相似,但在涉及新分布的多种场景中,拼接后的表征显著优于同等规模单次训练的模型。这表明多次训练构建的表征实际上存在差异。虽然拼接后的表征在训练分布下携带的关于训练任务的额外信息有限,但当任务或分布发生变化时,其信息量大幅提升。与之对比,单次训练难以产生此类冗余表征,因为优化过程没有理由积累无法逐步提升训练性能的特征。