Wait a second, they define the induced vector field (and consequently Lie bracket) in terms of batch-size 1 SGD:
> In particular, if x is a training example and L(x) is the per-example loss for the training example x, then this vector field is: v^(x)(θ) = -∇_θ L(x). In other words, for a specific training example, the arrows of the resulting vector field point in the direction that the parameters should be updated.
but for the MXResNet example:
> The optimizer is Adam, with the following parameters: lr = 5e-3, betas = (0.8, 0.999)
This changes the direction of the updates, such that I'm not completely sure the intuitive equivalence holds.
If it were just SGD with momentum, then the measured update directions would be a combination of the momentum vector and v1/v2, so {M + v1, M + v2} = {v1, M} + {M, v2} + {v1, v2}. The Lie bracket is no longer "just" a function of the model parameters and the training examples; it's now inherently path dependent.
For Adam, the parameter-wise normalization by the second norm will also slightly change the directions of the updates in a nonlinear way (thanks to the β2 term).
The interpretation is also strained with fancier optimizers like Muon; this uses both momentum and (approximate) SVD normalization, so I'm really not sure what to expect.
Lie brackets are bi-linear so whatever you do per example automatically carries over to sums, the bracket for the batch is just the pairwise brackets for elements in the batch, i.e. {a + b + c, d} = {a, d} + {b, d} + {c, d}. Similarly for the second component.
> An ideal machine learning model would not care what order training examples appeared in its training process. From a Bayesian perspective, the training dataset is unordered data and all updates based on seeing one additional example should commute with each other.
One of Andrew Gelman's favorite points to make about science 'as practiced' is that researchers fail to behave this way. There's a gigantic bias in favor of whatever information is published first.
Wait a second, they define the induced vector field (and consequently Lie bracket) in terms of batch-size 1 SGD:
> In particular, if x is a training example and L(x) is the per-example loss for the training example x, then this vector field is: v^(x)(θ) = -∇_θ L(x). In other words, for a specific training example, the arrows of the resulting vector field point in the direction that the parameters should be updated.
but for the MXResNet example:
> The optimizer is Adam, with the following parameters: lr = 5e-3, betas = (0.8, 0.999)
This changes the direction of the updates, such that I'm not completely sure the intuitive equivalence holds.
If it were just SGD with momentum, then the measured update directions would be a combination of the momentum vector and v1/v2, so {M + v1, M + v2} = {v1, M} + {M, v2} + {v1, v2}. The Lie bracket is no longer "just" a function of the model parameters and the training examples; it's now inherently path dependent.
For Adam, the parameter-wise normalization by the second norm will also slightly change the directions of the updates in a nonlinear way (thanks to the β2 term).
The interpretation is also strained with fancier optimizers like Muon; this uses both momentum and (approximate) SVD normalization, so I'm really not sure what to expect.
Was hoping for a tournament bracket of best lies found in training data :(
Could this be used for batch filtering?
Lie brackets are bi-linear so whatever you do per example automatically carries over to sums, the bracket for the batch is just the pairwise brackets for elements in the batch, i.e. {a + b + c, d} = {a, d} + {b, d} + {c, d}. Similarly for the second component.
> Similarly for the second component.
Hmm.
{a + b, c + d} = {a, c + d} + {b, c + d} = {a, c} + {a, d} + {b, c} + {b, d}.
{a + b + c, x + y + z} = {a, x + y + z} + {b, x + y + z} + {c, x + y + z} = (a sum of nine direct brackets).
This doesn't look like it will scale well.
Then don't use the Lie bracket. All bilinear forms scale the same way.
> An ideal machine learning model would not care what order training examples appeared in its training process. From a Bayesian perspective, the training dataset is unordered data and all updates based on seeing one additional example should commute with each other.
One of Andrew Gelman's favorite points to make about science 'as practiced' is that researchers fail to behave this way. There's a gigantic bias in favor of whatever information is published first.
Eventually ML folks will discover fiber bundles.
But what bastard "new" name will they give them?
Sooner if you explain why.