This paper is a theoretical analysis showing that the ridge regularization that optimizes the source task almost never optimizes transfer performance. Interestingly, in high SNR regimes (low noise) the optimal regularization for pre-training is higher than the task specific optimal regularization, and in low SNR regimes (high noise) it’s better to regularize less than you would if you were just optimizing for that task.
Although the proofs are in the world of (L2-SP) ridge regression, experiments were run using an MLP on MNIST and CNN on CIFAR-10 and suggest the SNR-regularization relationship persists in non-linear networks.
This paper is a theoretical analysis showing that the ridge regularization that optimizes the source task almost never optimizes transfer performance. Interestingly, in high SNR regimes (low noise) the optimal regularization for pre-training is higher than the task specific optimal regularization, and in low SNR regimes (high noise) it’s better to regularize less than you would if you were just optimizing for that task.
Although the proofs are in the world of (L2-SP) ridge regression, experiments were run using an MLP on MNIST and CNN on CIFAR-10 and suggest the SNR-regularization relationship persists in non-linear networks.