The behavior of a random walk in a high dimensional space can be counter-intuitive. If you take the random walk trajectory and then perform principal components analysis on it, it turns out more than half of the variance is along a single direction. More than 80% is along the first two principal components.
To make matters even more surprising, if you project the random walk trajectory down into these PCA subspaces they are no longer random at all. Instead the trajectory traces a Lissajous curve. (For example see figure 1 of this paper: https://proceedings.neurips.cc/paper/2018/file/7a576629fef88...)
They say "these results are completely general for any probability distribution with zero mean and a finite covariance matrix with rank much larger than the number of steps". It's not clear to me if that condition implies the number of steps is much lower than the dimensions of the random walk space or perhaps the probability distribution needs to be concentrated into a smaller number of dimensions to begin with? In which case the results is much less shocking.
The condition is the former. The probability distribution spans the full dimensionality of the space. Basically, the result will hold for an infinite number of dimensions and a finite number of steps. But it will also hold if you take both the number of steps and the dimensionality to infinity while holding the ratio N_steps / D constant with N_steps / D << 1.
> Therefore, despite the insanely large number of adjustable parameters, general solutions, that are meaningful and predictive, can be found by adding random walks around the objective landscape as a partial strategy in combination with gradient descent.
Are there methods that specifically apply this idea?
I guess this is a good explanation for why deep learning isn't just automatically impossible, because if local minima were everywhere then it would be impossible. But on the other hand, usually the goal isn't to add more and more parameters, it's to add just enough so that common features can be identified but not enough to "memorize the dataset." And to design an architecture that is flexible enough but is still quite restricted, and can't represent any function. And of course in many cases (especially when there's less data) it makes sense to manually design transformations from the high dimensional space to a lower dimensional one that contains less noise and can be modeled more easily.
The article feels connected to the manifold hypothesis, where the function we're modeling has some projection into a low dimensional space, making it possible to model. I could imagine a similar thing where if a potential function has lots of ridges, you can "glue it together" so all the level sets line up, and that corresponds with some lower dimensional optimization problem that's easier to solve. Really interesting and I found it super clearly written.
> Are there methods that specifically apply this idea?
Stochastic gradient descent is basically this (not exactly the sane, but the core intuitions align IMO). Not exactly optimization but Hamiltonian MCMC also seems highly related.
> I could imagine a similar thing where if a potential function has lots of ridges, you can "glue it together" so all the level sets line up, and that corresponds with some lower dimensional optimization problem that's easier to solve.
Excellent intuition, this is exactly the idea of HMC (as far as I recall); the concrete math behind this is (IIRC) a "fiber bundle".
> There is one chance in ten that the walker will take a positive or negative step along any given dimension at each time point.
This confused me a bit. To clarify: at each step, the random walker selects a dimension (with probability 1/10 for any given dimension), and then chooses a direction along that dimension (positive or negative, each with probability 1/2). There are 20 possible moves to choose from at any step.
I thought multidimensional random walkers would make random choices on all dimensions, so:
step = [random.choice([-1,0,1]) for _d in range(n_dimensions)]
At least this is how I did 2D random walks as this allows for diagonal steps (with the downside that the walker travels longer steps in that direction).
The common definition for random walks moves only by unit vectors. Unfortunately, the information on Wikipedia is somewhat limited. The book "Random Walk: A Modern Introduction" (2010) by Gregory Lawler describes things in the first chapter, and is available online for free [1].
> On the other hand, a so-called mountain peak would be a 5 surrounded by 4’s or lower. The odds for having this happen in 10D are 0.2*(1-0.8^10) = 0.18. Then the total density of mountain peaks, in a 10D hyperlattice with 5 potential values, is only 18%.
I believe the odds are actually
0.2 (odds of it being a 5) ×
0.8^10 (odds of each of the neighbors being ∈ {1,2,3,4})
which is ~0.021 or around 2%. This makes much more sense, since 18% of the nodes being peaks doesn't sound like they are rare.
Turns out there is a very interesting theorem by Polya about random walks that separate 1 or 2 dimensional random walks from higher dimensional ones. I thought I'd link this video, because it's so well done.
The behavior of a random walk in a high dimensional space can be counter-intuitive. If you take the random walk trajectory and then perform principal components analysis on it, it turns out more than half of the variance is along a single direction. More than 80% is along the first two principal components.
To make matters even more surprising, if you project the random walk trajectory down into these PCA subspaces they are no longer random at all. Instead the trajectory traces a Lissajous curve. (For example see figure 1 of this paper: https://proceedings.neurips.cc/paper/2018/file/7a576629fef88...)
They say "these results are completely general for any probability distribution with zero mean and a finite covariance matrix with rank much larger than the number of steps". It's not clear to me if that condition implies the number of steps is much lower than the dimensions of the random walk space or perhaps the probability distribution needs to be concentrated into a smaller number of dimensions to begin with? In which case the results is much less shocking.
The condition is the former. The probability distribution spans the full dimensionality of the space. Basically, the result will hold for an infinite number of dimensions and a finite number of steps. But it will also hold if you take both the number of steps and the dimensionality to infinity while holding the ratio N_steps / D constant with N_steps / D << 1.
Thank you for sharing. Learning about PCA subspaces and Lissajous curves wasn't originally on my agenda today.
> Therefore, despite the insanely large number of adjustable parameters, general solutions, that are meaningful and predictive, can be found by adding random walks around the objective landscape as a partial strategy in combination with gradient descent.
Are there methods that specifically apply this idea?
I guess this is a good explanation for why deep learning isn't just automatically impossible, because if local minima were everywhere then it would be impossible. But on the other hand, usually the goal isn't to add more and more parameters, it's to add just enough so that common features can be identified but not enough to "memorize the dataset." And to design an architecture that is flexible enough but is still quite restricted, and can't represent any function. And of course in many cases (especially when there's less data) it makes sense to manually design transformations from the high dimensional space to a lower dimensional one that contains less noise and can be modeled more easily.
The article feels connected to the manifold hypothesis, where the function we're modeling has some projection into a low dimensional space, making it possible to model. I could imagine a similar thing where if a potential function has lots of ridges, you can "glue it together" so all the level sets line up, and that corresponds with some lower dimensional optimization problem that's easier to solve. Really interesting and I found it super clearly written.
> Are there methods that specifically apply this idea?
Stochastic gradient descent is basically this (not exactly the sane, but the core intuitions align IMO). Not exactly optimization but Hamiltonian MCMC also seems highly related.
> I could imagine a similar thing where if a potential function has lots of ridges, you can "glue it together" so all the level sets line up, and that corresponds with some lower dimensional optimization problem that's easier to solve.
Excellent intuition, this is exactly the idea of HMC (as far as I recall); the concrete math behind this is (IIRC) a "fiber bundle".
> There is one chance in ten that the walker will take a positive or negative step along any given dimension at each time point.
This confused me a bit. To clarify: at each step, the random walker selects a dimension (with probability 1/10 for any given dimension), and then chooses a direction along that dimension (positive or negative, each with probability 1/2). There are 20 possible moves to choose from at any step.
Thanks for this. It goes back to the node connectivity graphs he shows just above that statement.
He is thinking about a random choice among the 20 edges branching out from each vertex.
I thought multidimensional random walkers would make random choices on all dimensions, so:
At least this is how I did 2D random walks as this allows for diagonal steps (with the downside that the walker travels longer steps in that direction).The common definition for random walks moves only by unit vectors. Unfortunately, the information on Wikipedia is somewhat limited. The book "Random Walk: A Modern Introduction" (2010) by Gregory Lawler describes things in the first chapter, and is available online for free [1].
[1] https://www.math.uchicago.edu/~lawler/srwbook.pdf
> On the other hand, a so-called mountain peak would be a 5 surrounded by 4’s or lower. The odds for having this happen in 10D are 0.2*(1-0.8^10) = 0.18. Then the total density of mountain peaks, in a 10D hyperlattice with 5 potential values, is only 18%.
I believe the odds are actually
0.2 (odds of it being a 5) ×
0.8^10 (odds of each of the neighbors being ∈ {1,2,3,4})
which is ~0.021 or around 2%. This makes much more sense, since 18% of the nodes being peaks doesn't sound like they are rare.
Tangentially related:
https://www.youtube.com/watch?v=iH2kATv49rc
Turns out there is a very interesting theorem by Polya about random walks that separate 1 or 2 dimensional random walks from higher dimensional ones. I thought I'd link this video, because it's so well done.
Love this quote from Shizuo Kakutani to describe Polya's result: "A drunk man will find his way home, but a drunk bird may get lost forever."