> This post covers one appealing way to constrain the weight matrices of a neural network—by keeping the tensors constrained to submanifolds at each layer. This opens the door to re-thinking optimization, as we can co-design optimization algorithms with these manifold constraints. As an example, we propose a manifold version of the Muon optimizer whose weights are constrained to the Stiefel manifold: the manifold of matrices with unit condition number. We conclude the post by defining the idea of a modular manifold, which is a composable manifold that attempts to make it easier to scale up and train large networks.
Very good presentation. Projected gradient methods were popular during the convex optimization craze two decades ago. The ideas advanced here have precedent and seem sensible to me. My concern is whether it helps much. The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working. The higher LR did not translate to a speed up: "Manifold Muon increased the wall clock time per step compared to AdamW..."
More fundamentally, I am a bit skeptical that low test accuracy is the right goal in LLMs because statistical learning theory does not adequately model the macro-behavior of very large models.
\> statistical learning theory does not adequately model the macro-behavior of very large models.
Might you please elaborate on this? I recognize that "artificial neural networks are lossy de/compression algorithms" does not enumerate the nuances of these structures, but am curious whether anything in particular is both interesting and missing from SLT.
SLT typically uses empirical risk minimization, leading to the bias-variance decomposition and a unimodal extremum as the monotonically decreasing bias supposedly balances against the monotonically increasing variance. We now know this does not accurately model overparameterized models, which exhibit double descent, and other phenomena like grokking. To explain them you have to look past classical statistics to statistical mechanics.
> The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working.
Sounds like it might help for online RL training regimes as those are naturally quite vulnerable to overfitting .
The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working.
What's your point? Sometimes things need to be retried. Sometimes there are small subtle details that make or break an idea. So what's the point of acting dismissively? If an old idea that didn't work now works, then what's the problem? Besides, progress is typically iterative, not through leaps and bounds. The vast majority of things that look like leaps are just because we don't see the steps between.
The reason I'm saying this is because that sentiment is often used to pass over working solutions and slows down their progress. So even if unintentional it should cause us to rethink how we respond. Otherwise we end up with such silly claims like "Einstein just used Tensors" and "Nash just used topology". In some sense these are accurate, but they are too high level descriptions (and these are real dismissals. Which again, so what? If it works, it works?).
Why is "novelty" even a question? Novelty is only ever in the eyes of the beholder.
> What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.
Honestly, I do not know, but I'll give you my best read on it.
1) Scale: Don't underestimate the importance of this. While I don't think scale is all you need, it certainly is a critical factor.
2) Different optimization: I may be missing something, but it looks like they are using a different optimizer. They mention that they're using the muon optimizer constraining to a Stiefel manifold. Neither of those things are unique on their own, but is their combination? This is where I'm uncertain because such a thing would be easy to miss. Maybe someone did and was unsuccessful with it. Maybe someone did, but was not at scale. Maybe someone did, it worked, and just nobody noticed (that happens a lot!).
So I think this is quite similar to how 99% of progress and breakthroughs are made: putting together ideas that seem unrelated and inventing some glue to generalize the process. At a high level this always looks like you're just putting existing things together, but that glue is really hard to make. And to continue that analogy, if we do a good enough job gluing things together then to anyone but an expert it'll look like there is no glue. It can be surprisingly difficult to tell if something is glued, welded, mated, milled, printed, or whatever. It usually takes a very keen eye to determine the answer non-destructively.
Apologies if this came across the wrong way. I really do want to know what the novel contributions of the post are, because the author implies that something about what they're doing is solving previously open problems:
> I figured out how to solve manifold Muon in the square case late last year, but I was unable to solve the full rectangular case and thus posed the problem as an open problem on the Modula docs. Jianlin Su solved the problem this summer
It sounds like the generalisation of projected gradient decent to "Muon" is what they're focusing on, but the derivation is all about the retraction map on the Stiefel manifold? I think I'm missing some background here.
I was uncertain but your other statements made me think that sentiment was unintentional. I just want to push back against it because it is too common and misused even with good intentions. I hope you don't see this as me saying anything about your character. Honestly, impressions are that you do care.
> It sounds like the generalisation of projected gradient decent to "Muon"
I'm not a niche expert here, but do have knowledge in adjacent/overlapping domains. It sounds like you're in a similar boat? I ask because this pulls back to what I was trying to say about sometimes needing an expert eye.
If it helps, here's the "paper" for the Muon optimizer[0] and here's a follow-up[1]. Muon is definitely a gradient decent technique, but so are Adam, SGD, Ada, and many more[2].
The big thing of Muon is using NewtonSchulz5. So you update parameters with θ_{t-1} - η[NS_5(μB_{t-1} + ∇L(θ_{t-1}))] (I bracketed so you can see that this is just a specific version of θ_{t-1} - ηF(∇L(θ_{t-1}),...) which the standard gradient descent -- θ - η∇L(θ) -- is in that class of functions, right?). So we should be careful to over generalize and say that this is just gradient descent. You could even say [1] is "just [0] but with weight-decay" (or go look at the Adam and AdamW algos ;)
But one thing I should add is that gradient descent algorithms aren't topologically aware. I was able to find this post which asks a related question, trying to find what the conditions are for a surface's geodesic to align with gradient descent (note Newton differs from GD too). I don't think this paper is creating a solution where the GD formulation results in following a geodesic to the minimum, but my take is that it is working towards that direction. And to clarify, we'd want to follow the geodesic because that gives us the shortest or most energy efficient path (which ever perspective you want to use). In optimization we want to try to accomplish these two things (and more!): 1) take the "best" path to the optima, 2) find the best optima. Unfortunately these are ill-defined and there's not always objective answers to them. But in an ideal gradient descent algorithm we'd want it to go to the global minimum and take the fastest path, right? So with that it helps to be aware of the geometry (part of why people look at the Hessian but that comes at the cost of increased computation even if the additional information can get us there in fewer steps. So that's not (always) "the best").
I know this isn't a full answer and maybe with more reading I'll have a better one for you. But I'm hoping my answer can at least help you see some of the underlying nuanced problems that (_I think_) the authors are trying to get at. Hopefully I'm not too far off base lol. I'm hoping someone with more expertise can jump in and provide corrections/clarifications in the mean time.
Is the original Thinking Machines trademark[0] no longer active? They were the original AI company, back when AI was a completely different thing than it is today.
Not here to comment on the _content_ of the blog post...
Just wanted to say the blog post design looks super nice. Beautifully laid out, very readable typography, clear graphics, approachable design with a welcoming UX, footnotes in the side, etc.
Anybody know how this is designed / styled? (I can see three.js being used, along with katex.js - but don't know more details)
I think the diagrams look very similar to what Keenan Crane uses in his papers, perhaps they used that tool. I think his students have now fleshed it out for general use.
Interesting. Modular manifolds are precisely what hypertokens use for prompt compiling.
Specifically, we linearize the emergent KVQ operations of an arbitrary prompt in any arbitrary model by way of interleaving error-correcting code (ECC).
ECC tokens are out-of-band tokens, e.g., Unicode's Private Use Area (PUA), interleaved with raw context tokens. This construction induces an in-context associate memory.
Any sort of interleaved labeling basis, e.g., A1, quick brown fox, A2, jumped lazy dog, induces a similar effect to for chaining recall & reasoning more reliably.
This trick works because PUA tokens are generally untrained hence their initial embedding is still random Gaussian w.h.p. Similar effects can be achieved by simply using token combos unlikely to exist and are often in practice more effective since PUA tokens like emojis or Mandarin characters are often 2,3, or 4 tokens after tokenization vs. codeword combos like zy-qu-qwerty every k content tokens, where can be variable.
Building attention architecture using modular manifolds in white / gray-box models like this new work shows vs. prompt-based black box injection is a natural next step, and so can at least anecdotally validate what they're building ahead of next paper or two.
Which is all to say, absolutely great to see others building in this way!
Nope. Construction induces ECC-driven emergent modular manifolds in latent space during KVQ maths. Can't use any ole ECC / crux why works. More in another reply.
The original article discusses techniques for constraining the weights of a neural network to a submanifold of weight space during training. Your comment discusses interleaving the tokens of an LLM prompt with Unicode PUA code points. These are two almost completely unrelated things, so it is very confusing to me that you are confidently asserting that they are the same thing. Can you please elaborate on why you think there is any connection at all between your comment and the original article?
Our ECC construction induces an emergent modular manifold during KVQ computation.
Suppose we use 3 codeword lanes every codeword which is our default. Each lane of tokens is based on some prime, p, so collectively forms CRT-driven codeword (Chinese Remainder Theorem). This is discretely equivalent to labeling every k tokens with 1x globally unique indexing grammar.
That interleaving also corresponds to a triple of adjacent orthogonal embeddings since those tokens still retain a random gaussian embedding. The net effect is we similarly slice the latent space into spaced chain of modular manifolds within the latent space every k content tokens.
We also refer to that interleaving as Steifel frames for similar reasons as the post reads etc. We began work this spring or so to inject that net construction inside the model with early results in similar direction as post described. That's another way of saying this sort of approach lets us make that chained atlas (wc?) of modular manifolds as tight as possible within dimensional limits of the embedding, floating point precision, etc.
We somewhat tongue-in-cheek refer to this as the retokenization group at the prompt level re: renormalization group / tensor nets / etc. Relayering group is the same net intuition or perhaps reconnection group at architecture level.
I'm sorry, but even if I am maximally charitable and assume that everything you are saying is meaningful and makes sense, it still has essentially nothing to do with the original article. The original article is about imposing constraints on the weights of a neural network, during training, so that they lie on a particular manifold inside the overall weight space. The "modular" part is about being able to specify these constraints separately for individual layers or modules of a network and then compose them together into a meaningful constraint for the global network.
You are talking about latent space during inference, not weight space during training, and you are talking about interleaving tokens with random Gaussian tokens, not constraining values to lie on a manifold within a larger space. Whether or not the thing you are describing is meaningful or useful, it is basically unrelated to the original article, and you are not using the term "modular manifold" to refer to the same thing.
hmm / hear you. my point wasn't that we are applying modular manifolds in the same way it was that we are working on model reliability from two extremal ends using the same principle. there are various ways to induce modular manifolds in model at various levels of resolution / power. we started at outside / working in level and so it works with any black-box model out of the box and zero knowledge needed, dont even need to know token dictionary to show effect.
We're already working on pushing construction deeper into model both architecture and training. currently that's for fine-tuning and ultimately full architecture shrinkage / pruning and raw training vs. just fine-tuning etc.
& it was just great to see someone else using modular manifolds even if they are using them at the training stage vs. inference stage. they're exploiting modular form at training, we're doing it at inference. cool to see.
They say they train for ~3 epochs. Could it be that's just not long enough of a training run? I have no idea how many epochs are usually used in those models.
Sure, of course. Wasn't suggesting "are you beating a sota benchmark"? I'm floating the idea of an ablation that matches a realistic scenario for the dataset / task. Personally curious how manifold muon performs compared to AdamW in a throughly explored context. This is the first time I've seen a 3-layer mlp on cifar-10.
I probably should have made the 9-layer ResNet part more, front-and-center / central to my point.
"I have never had to do integrate the "arctan" function by hand in my entire career" arguments are not worth engaging with.
If people are happy with a job or a role that does not need math that' fine.
Familiarity with Maths let's you to rise to the occasion, to become more than a replaceable cog.
The thing is, unless you are trained in math you wouldn't even recognise the opportunity, that a certain kind Of Math could have been used here. In fact, even if you are trained in Math you may not see it till much later -- it needs a special eye and something in that moment.
Polyhedrons were looked at for centuries after centuries by top-notch mathematicians. All missed Euler's formula, except perhaps Descartes.
Often what happens is some nontrivial branch of mathematics suddenly finds a novel and impactful application. Then crowds jump in to learn that Math. But it's mostly already a little too late for them, they have missed this bus.
The best case is one already knows the Math beforehand and you don't know which part will be handy. It helps if you love the subject and can afford to invest time to learn it for the love of the subject. Once in a while you happen to find yourself in the right place and the right time and with the right tools you need.
> Often what happens is some nontrivial branch of mathematics suddenly finds a novel and impactful application. Then crowds jump in to learn that Math. But it's mostly already a little too late for them, they have missed this bus.
However, in the meantime, the experts in that math have "missed the bus" on whatever the application area is, that the math expert knows not enough about because they were studying math instead.
Nice! Posts like this make me remorseful of not following a mathematics career. I'm sure some of the notation is basic (as in undergrad) but I'd need an entire weekend to understand the post.
This is exactly the kind of out-of-the-box thinking that will get us past some of the limitations of the current crop of AI architectures. Bravo to the authors.
so their way to differentiate against frontier labs is to try writing research blog posts (not papers). It will be interesting to see how this plays out. I don't think that anyone serious about developing frontier models would be putting anything useful out there for others. We already see this with all the incumbents -- Google, OAI, Anthropic, xAI, DeepSeek and other chinese labs.
Well-done post, I'd like to read more of their work and it's exciting to see these new ideas. Though as other people have said, the one set of empirical results that they present is a bit... confusing? I'd think they'd have some more compelling examples to present given all the pretty math.
Their modular norm paper (https://arxiv.org/abs/2405.14813) has several more examples; see their appendix D in particular, but these are also mystifying. Yes they're interested in how things scale but am I the only one to whom it seems that the training losses they report are just not competitive with things that are currently being used?
TL;DR: The OP notes that we currently use all sorts of tricks of the trade, including applying normalization layers, to keep unit values in DNNs from getting too large or too small when we train them. Keeping unit values from getting too large or small prevents numerical underflow/overflow, and also helps speed up learning by keeping the magnitudes of updates small in relation to weights. The OP proposes that we should constrain weights to be in sub-manifolds with unit condition number[a] at each layer, and that we should modify/design SGD algorithms to work well within those manifolds.
I find the idea compelling, but it's too early to know if it will work well at scale, you know, with large models, in the real world.
EDIT: On the other hand, yesterday I saw a paper about doing basically the opposite, letting unit values in DNNs get as big or small as they need to get... by mapping them to complex logarithms and keeping them in that domain: https://openreview.net/forum?id=SUuzb0SOGu . I also found this opposing idea oddly compelling, but I don't know how well it works either, because it hasn't been tested at scale.
Doesn't apply as long as the improvements obtained there scale with compute.
Now, are there actual meaningful improvements to obtain, and do they stick around all the way to frontier runs? Unclear, really. So far, it looks like opening a can of hyperparameters.
this is a bad example to claim the bitter lesson applies to, it’s about the fundamentals of optimization techniques not about tying to hand-crafted things for the solution space.
> This post covers one appealing way to constrain the weight matrices of a neural network—by keeping the tensors constrained to submanifolds at each layer. This opens the door to re-thinking optimization, as we can co-design optimization algorithms with these manifold constraints. As an example, we propose a manifold version of the Muon optimizer whose weights are constrained to the Stiefel manifold: the manifold of matrices with unit condition number. We conclude the post by defining the idea of a modular manifold, which is a composable manifold that attempts to make it easier to scale up and train large networks.
Very good presentation. Projected gradient methods were popular during the convex optimization craze two decades ago. The ideas advanced here have precedent and seem sensible to me. My concern is whether it helps much. The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working. The higher LR did not translate to a speed up: "Manifold Muon increased the wall clock time per step compared to AdamW..."
More fundamentally, I am a bit skeptical that low test accuracy is the right goal in LLMs because statistical learning theory does not adequately model the macro-behavior of very large models.
\> statistical learning theory does not adequately model the macro-behavior of very large models.
Might you please elaborate on this? I recognize that "artificial neural networks are lossy de/compression algorithms" does not enumerate the nuances of these structures, but am curious whether anything in particular is both interesting and missing from SLT.
SLT typically uses empirical risk minimization, leading to the bias-variance decomposition and a unimodal extremum as the monotonically decreasing bias supposedly balances against the monotonically increasing variance. We now know this does not accurately model overparameterized models, which exhibit double descent, and other phenomena like grokking. To explain them you have to look past classical statistics to statistical mechanics.
> The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working.
Sounds like it might help for online RL training regimes as those are naturally quite vulnerable to overfitting .
The test accuracy in figure 6b shows a marginal increase, and a gentler transition to the overfitting regime, suggesting the regularization is working.
Higher LR does not mean there’s overfitting.
Isn't this an old idea? E.g., here's a textbook on optimization algorithms for matrix manifolds https://press.princeton.edu/absil and here's a library that implements this in python for the Stiefel manifold that's the subject of this blog post: https://pymanopt.org/docs/stable/manifolds.html#module-pyman...
What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.
I don't think it's been tried at scale, with large models.
It remains to be seen if it works better than conventional training schemes.
What's your point? Sometimes things need to be retried. Sometimes there are small subtle details that make or break an idea. So what's the point of acting dismissively? If an old idea that didn't work now works, then what's the problem? Besides, progress is typically iterative, not through leaps and bounds. The vast majority of things that look like leaps are just because we don't see the steps between.
The reason I'm saying this is because that sentiment is often used to pass over working solutions and slows down their progress. So even if unintentional it should cause us to rethink how we respond. Otherwise we end up with such silly claims like "Einstein just used Tensors" and "Nash just used topology". In some sense these are accurate, but they are too high level descriptions (and these are real dismissals. Which again, so what? If it works, it works?).
Why is "novelty" even a question? Novelty is only ever in the eyes of the beholder.
Honestly, I do not know, but I'll give you my best read on it.1) Scale: Don't underestimate the importance of this. While I don't think scale is all you need, it certainly is a critical factor.
2) Different optimization: I may be missing something, but it looks like they are using a different optimizer. They mention that they're using the muon optimizer constraining to a Stiefel manifold. Neither of those things are unique on their own, but is their combination? This is where I'm uncertain because such a thing would be easy to miss. Maybe someone did and was unsuccessful with it. Maybe someone did, but was not at scale. Maybe someone did, it worked, and just nobody noticed (that happens a lot!).
So I think this is quite similar to how 99% of progress and breakthroughs are made: putting together ideas that seem unrelated and inventing some glue to generalize the process. At a high level this always looks like you're just putting existing things together, but that glue is really hard to make. And to continue that analogy, if we do a good enough job gluing things together then to anyone but an expert it'll look like there is no glue. It can be surprisingly difficult to tell if something is glued, welded, mated, milled, printed, or whatever. It usually takes a very keen eye to determine the answer non-destructively.
Apologies if this came across the wrong way. I really do want to know what the novel contributions of the post are, because the author implies that something about what they're doing is solving previously open problems:
> I figured out how to solve manifold Muon in the square case late last year, but I was unable to solve the full rectangular case and thus posed the problem as an open problem on the Modula docs. Jianlin Su solved the problem this summer
It sounds like the generalisation of projected gradient decent to "Muon" is what they're focusing on, but the derivation is all about the retraction map on the Stiefel manifold? I think I'm missing some background here.
If it helps, here's the "paper" for the Muon optimizer[0] and here's a follow-up[1]. Muon is definitely a gradient decent technique, but so are Adam, SGD, Ada, and many more[2].
The big thing of Muon is using NewtonSchulz5. So you update parameters with θ_{t-1} - η[NS_5(μB_{t-1} + ∇L(θ_{t-1}))] (I bracketed so you can see that this is just a specific version of θ_{t-1} - ηF(∇L(θ_{t-1}),...) which the standard gradient descent -- θ - η∇L(θ) -- is in that class of functions, right?). So we should be careful to over generalize and say that this is just gradient descent. You could even say [1] is "just [0] but with weight-decay" (or go look at the Adam and AdamW algos ;)
But one thing I should add is that gradient descent algorithms aren't topologically aware. I was able to find this post which asks a related question, trying to find what the conditions are for a surface's geodesic to align with gradient descent (note Newton differs from GD too). I don't think this paper is creating a solution where the GD formulation results in following a geodesic to the minimum, but my take is that it is working towards that direction. And to clarify, we'd want to follow the geodesic because that gives us the shortest or most energy efficient path (which ever perspective you want to use). In optimization we want to try to accomplish these two things (and more!): 1) take the "best" path to the optima, 2) find the best optima. Unfortunately these are ill-defined and there's not always objective answers to them. But in an ideal gradient descent algorithm we'd want it to go to the global minimum and take the fastest path, right? So with that it helps to be aware of the geometry (part of why people look at the Hessian but that comes at the cost of increased computation even if the additional information can get us there in fewer steps. So that's not (always) "the best").
I know this isn't a full answer and maybe with more reading I'll have a better one for you. But I'm hoping my answer can at least help you see some of the underlying nuanced problems that (_I think_) the authors are trying to get at. Hopefully I'm not too far off base lol. I'm hoping someone with more expertise can jump in and provide corrections/clarifications in the mean time.
[0] https://kellerjordan.github.io/posts/muon/
[1] https://arxiv.org/abs/2502.16982
[2] (far from a complete list) https://docs.pytorch.org/docs/stable/optim.html#algorithms
[3] (I think similar types of questions may also be fruitful) https://mathoverflow.net/questions/42617/functions-whose-gra...
Is the original Thinking Machines trademark[0] no longer active? They were the original AI company, back when AI was a completely different thing than it is today.
[0] https://en.m.wikipedia.org/wiki/Thinking_Machines_Corporatio...
That company is defunct since 1994; 31 years ago
Not here to comment on the _content_ of the blog post...
Just wanted to say the blog post design looks super nice. Beautifully laid out, very readable typography, clear graphics, approachable design with a welcoming UX, footnotes in the side, etc.
Anybody know how this is designed / styled? (I can see three.js being used, along with katex.js - but don't know more details)
Thanks
UX on the other hand...I hate it when sites hijack my key commands for moving backwards and forwards in my browser history. Please don't do this!
For me it's horrible, some scripts makes the scroll very choppy, unusable... had to disable scripts just to be able to normally scroll :-(
I think the diagrams look very similar to what Keenan Crane uses in his papers, perhaps they used that tool. I think his students have now fleshed it out for general use.
Interesting. Modular manifolds are precisely what hypertokens use for prompt compiling.
Specifically, we linearize the emergent KVQ operations of an arbitrary prompt in any arbitrary model by way of interleaving error-correcting code (ECC).
ECC tokens are out-of-band tokens, e.g., Unicode's Private Use Area (PUA), interleaved with raw context tokens. This construction induces an in-context associate memory.
Any sort of interleaved labeling basis, e.g., A1, quick brown fox, A2, jumped lazy dog, induces a similar effect to for chaining recall & reasoning more reliably.
This trick works because PUA tokens are generally untrained hence their initial embedding is still random Gaussian w.h.p. Similar effects can be achieved by simply using token combos unlikely to exist and are often in practice more effective since PUA tokens like emojis or Mandarin characters are often 2,3, or 4 tokens after tokenization vs. codeword combos like zy-qu-qwerty every k content tokens, where can be variable.
Building attention architecture using modular manifolds in white / gray-box models like this new work shows vs. prompt-based black box injection is a natural next step, and so can at least anecdotally validate what they're building ahead of next paper or two.
Which is all to say, absolutely great to see others building in this way!
Wot? Is this what AI generated non-sense has come to? This is totally unrelated.
Nope. Construction induces ECC-driven emergent modular manifolds in latent space during KVQ maths. Can't use any ole ECC / crux why works. More in another reply.
The original article discusses techniques for constraining the weights of a neural network to a submanifold of weight space during training. Your comment discusses interleaving the tokens of an LLM prompt with Unicode PUA code points. These are two almost completely unrelated things, so it is very confusing to me that you are confidently asserting that they are the same thing. Can you please elaborate on why you think there is any connection at all between your comment and the original article?
Our ECC construction induces an emergent modular manifold during KVQ computation.
Suppose we use 3 codeword lanes every codeword which is our default. Each lane of tokens is based on some prime, p, so collectively forms CRT-driven codeword (Chinese Remainder Theorem). This is discretely equivalent to labeling every k tokens with 1x globally unique indexing grammar.
That interleaving also corresponds to a triple of adjacent orthogonal embeddings since those tokens still retain a random gaussian embedding. The net effect is we similarly slice the latent space into spaced chain of modular manifolds within the latent space every k content tokens.
We also refer to that interleaving as Steifel frames for similar reasons as the post reads etc. We began work this spring or so to inject that net construction inside the model with early results in similar direction as post described. That's another way of saying this sort of approach lets us make that chained atlas (wc?) of modular manifolds as tight as possible within dimensional limits of the embedding, floating point precision, etc.
We somewhat tongue-in-cheek refer to this as the retokenization group at the prompt level re: renormalization group / tensor nets / etc. Relayering group is the same net intuition or perhaps reconnection group at architecture level.
I'm sorry, but even if I am maximally charitable and assume that everything you are saying is meaningful and makes sense, it still has essentially nothing to do with the original article. The original article is about imposing constraints on the weights of a neural network, during training, so that they lie on a particular manifold inside the overall weight space. The "modular" part is about being able to specify these constraints separately for individual layers or modules of a network and then compose them together into a meaningful constraint for the global network.
You are talking about latent space during inference, not weight space during training, and you are talking about interleaving tokens with random Gaussian tokens, not constraining values to lie on a manifold within a larger space. Whether or not the thing you are describing is meaningful or useful, it is basically unrelated to the original article, and you are not using the term "modular manifold" to refer to the same thing.
hmm / hear you. my point wasn't that we are applying modular manifolds in the same way it was that we are working on model reliability from two extremal ends using the same principle. there are various ways to induce modular manifolds in model at various levels of resolution / power. we started at outside / working in level and so it works with any black-box model out of the box and zero knowledge needed, dont even need to know token dictionary to show effect.
We're already working on pushing construction deeper into model both architecture and training. currently that's for fine-tuning and ultimately full architecture shrinkage / pruning and raw training vs. just fine-tuning etc.
& it was just great to see someone else using modular manifolds even if they are using them at the training stage vs. inference stage. they're exploiting modular form at training, we're doing it at inference. cool to see.
The learning rates they demonstrate are crazy - though the standard when talking about CIFAR-10 is 94% accuracy iirc. Showing ~60% accuracy is weird.
Has DAWNBench been done with manifold Muon (with a more appropriate architecture)?
They say they train for ~3 epochs. Could it be that's just not long enough of a training run? I have no idea how many epochs are usually used in those models.
Um.. the model is tiny: https://github.com/thinking-machines-lab/manifolds/blob/main...
Yeah, it's just the wrong architecture for the job, so I found it to be a strange example.
Here's the top model on DAWNBench - https://github.com/apple/ml-cifar-10-faster/blob/main/fast_c...
Trains for 15 epochs and it, like all the others is a 9 layer resnet.
Usually there's more to a ML, data-science idea (that's not a full fledged fledged out journal paper) than beating a SOTA benchmark.
In fact beating SOTA is often the least interesting part of an interesting paper and the SOTA-blind reviewers often use it as a gatekeeping device.
Sure, of course. Wasn't suggesting "are you beating a sota benchmark"? I'm floating the idea of an ablation that matches a realistic scenario for the dataset / task. Personally curious how manifold muon performs compared to AdamW in a throughly explored context. This is the first time I've seen a 3-layer mlp on cifar-10.
I probably should have made the 9-layer ResNet part more, front-and-center / central to my point.
Got you, this time.
its a 3-layer MLP as stated in the article
https://archive.is/bP3BG
If you like to scroll on mobile :)
Reminiscing about an old HN comment arguing that differential geometry was irrelevant to machine learning with a smile on my face.
Happy to see this opinion expressed here, too. The more math skeptics there are out there, the longer I get to keep my job. :)
"I have never had to do integrate the "arctan" function by hand in my entire career" arguments are not worth engaging with.
If people are happy with a job or a role that does not need math that' fine.
Familiarity with Maths let's you to rise to the occasion, to become more than a replaceable cog.
The thing is, unless you are trained in math you wouldn't even recognise the opportunity, that a certain kind Of Math could have been used here. In fact, even if you are trained in Math you may not see it till much later -- it needs a special eye and something in that moment.
Polyhedrons were looked at for centuries after centuries by top-notch mathematicians. All missed Euler's formula, except perhaps Descartes.
Often what happens is some nontrivial branch of mathematics suddenly finds a novel and impactful application. Then crowds jump in to learn that Math. But it's mostly already a little too late for them, they have missed this bus.
The best case is one already knows the Math beforehand and you don't know which part will be handy. It helps if you love the subject and can afford to invest time to learn it for the love of the subject. Once in a while you happen to find yourself in the right place and the right time and with the right tools you need.
> Often what happens is some nontrivial branch of mathematics suddenly finds a novel and impactful application. Then crowds jump in to learn that Math. But it's mostly already a little too late for them, they have missed this bus.
However, in the meantime, the experts in that math have "missed the bus" on whatever the application area is, that the math expert knows not enough about because they were studying math instead.
The world is full of useful shapes! No reason that math shouldn't :)
Nice! Posts like this make me remorseful of not following a mathematics career. I'm sure some of the notation is basic (as in undergrad) but I'd need an entire weekend to understand the post.
Curious why the authors chose the blog format over a research report?
thinkingmachines likes to flex
you mean a paper? because it's not paper quality content?
Exactly, it’s like they’re targeting people who don’t really know much about ML but are easily wowed by fancy math jargon and nice drawings.
This is exactly the kind of out-of-the-box thinking that will get us past some of the limitations of the current crop of AI architectures. Bravo to the authors.
so their way to differentiate against frontier labs is to try writing research blog posts (not papers). It will be interesting to see how this plays out. I don't think that anyone serious about developing frontier models would be putting anything useful out there for others. We already see this with all the incumbents -- Google, OAI, Anthropic, xAI, DeepSeek and other chinese labs.
Because it’s not research quality. The only people excited by this are people who don’t know anything about actual ML, and think this is amazing.
Why is it not research quality? What’s missing?
Well-done post, I'd like to read more of their work and it's exciting to see these new ideas. Though as other people have said, the one set of empirical results that they present is a bit... confusing? I'd think they'd have some more compelling examples to present given all the pretty math.
Their modular norm paper (https://arxiv.org/abs/2405.14813) has several more examples; see their appendix D in particular, but these are also mystifying. Yes they're interested in how things scale but am I the only one to whom it seems that the training losses they report are just not competitive with things that are currently being used?
What does this mean?
TL;DR: The OP notes that we currently use all sorts of tricks of the trade, including applying normalization layers, to keep unit values in DNNs from getting too large or too small when we train them. Keeping unit values from getting too large or small prevents numerical underflow/overflow, and also helps speed up learning by keeping the magnitudes of updates small in relation to weights. The OP proposes that we should constrain weights to be in sub-manifolds with unit condition number[a] at each layer, and that we should modify/design SGD algorithms to work well within those manifolds.
I find the idea compelling, but it's too early to know if it will work well at scale, you know, with large models, in the real world.
--
[a] https://en.wikipedia.org/wiki/Condition_number
--
EDIT: On the other hand, yesterday I saw a paper about doing basically the opposite, letting unit values in DNNs get as big or small as they need to get... by mapping them to complex logarithms and keeping them in that domain: https://openreview.net/forum?id=SUuzb0SOGu . I also found this opposing idea oddly compelling, but I don't know how well it works either, because it hasn't been tested at scale.
Hmmm… http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Doesn't apply as long as the improvements obtained there scale with compute.
Now, are there actual meaningful improvements to obtain, and do they stick around all the way to frontier runs? Unclear, really. So far, it looks like opening a can of hyperparameters.
this is a bad example to claim the bitter lesson applies to, it’s about the fundamentals of optimization techniques not about tying to hand-crafted things for the solution space.
Aren’t they all optimization techniques at the end of the day? Now you’re just debating semantics
believe what you want, i guess
[dead]