Not always true! Your statement is only true when the running clock's speed is the same as time. Thus, regular time and the clock's time will never meet.
If the clock is running faster than regular time, it will at point catch up to regular time and thus be correct for a split second. If the clock is slower than regular time, regular time will catch up to the clock and the clock will be right for a split second.
If the clock is running backwards at very high speed, it would be right infinitely many times but the proportion of the time that it is right would approach some finite constant.
If we are being pedantic, running clocks never run exactly the same as time. So they'll be right (very) much more seldom than the stopped clock, which is right twice a day.
> chose to make just about everything associated with Bamba open-source — the training recipes, the data, the data loader IBM designed for largescale distributed training, and a quantization framework aimed at shaving storage and inferencing costs.
This type of architecture is definitely the future. Unlimited attn is a dead end. As a human you don't need to scan an entire book just to guess what the next word will be and LLMs shouldn't need that either.
Humans can re-attend to material whenever necessary (i.e you can just re-read a book, re-watch a documentary etc when you feel you have missed crucial context) so it's not the end of the world. These SSMs or modern RNNs can't and if crucial context has been discarded by the end of the query then well too bad. Transformers are of course always re-attending so not an issue for them either. Until that issue is resolved, i don't think attention will be going anywhere.
Not be contrarian, but if the next word prediction happens to be someone's name or a place or something discussed multiple places in the book then often, yes, a knowledge of the full plot of the book is "required" just to predict the next word, as you get to the middle or end of a book.
For example you could never fill in the last chapter of any good book without having knowledge of every previous chapter. Not highly detailed knowledge, but still knowledge.
What an LLM does is stuff it all into short term memory. Humans dump the first pages into long term memory and "make sense" of it. Humans have a massive context window because of this (and sheer brain size and efficiency).
We don’t put things into long term memory after we read it. We usually put it after night of sleep. I personally think that context (and kv cache correspondingly) in the models are akin to our short term memory, while training process (and actual weights) are to our long term memory. And we can’t be sure our short term memory doesn’t work in a way of matching the current context towards currently stored short term memory. From this perspective transformers are enough and just fine.
So if you now hide my original comment and try to recall what I said, do you know it word for word (and are thinking if every word, e.g. did I use one or 2 spaces somewhere as that would change tokens) or do you have a rough concept of what I said?
OTOH if you had to remember a phone number to write it down, how does that differ?
I think in a way it makes transformers superior to humans, their short term memory is much more powerful =)
Supporting extra long contexts also make transformers super human. Because, again, human's short term memory is exactly this - short term. And much shorter than millions of tokens we expect from models nowadays.
As for SSMs - I think they compress model memory state way too much. Mixed global/local attention layers do just as well. And sparse/block attention seems like a way forward much more (https://arxiv.org/abs/2502.11089).
Human memory works completely different. It's not information stored in neurons, or being computed. My theory is that Eternalism ('Block Universe' is the other term) is real and that all memory is accomplished by the fact that your brain remains Quantum Entangled with all past and future "copies" of itself.
You know what "copies" means if you understand Eternalism. Each moment in the past "still" exists, and always exists. Probably about 50% of the best Physicists believe 'crazy' concepts like this (and multiverse, etc), even though it sounds crazy to uneducated people. The only thing that differentiates us is which crazy idea each of us buys into.
I believe all high negentropy systems can interact with their own future/past via the long causality chain going in both directions. This is how non-brain cell based intelligences work too (like Fungi and Plant life). Memory and Consciousness is essentially a retro-causal wave resonance across the entire causality chain of complex systems. We know for sure consciousness is far more correlated to waves than it is to synaptic actions. Neurons and synapses just carry signals. All the memory in consciousness is in the wave domain, as a secondary emergent effect of neuron charge flows.
Another recent transformer/SSM hybrid is "M1", with a more than 3x claimed inference speed-up compared to equivalent transformers: https://arxiv.org/pdf/2504.10449
IBM is claiming at least a 2x inference speed-up with Bamba. Both groups say that future SSM optimizations to vLLM would lead to further inference speed improvement.
I imported these to America to feed my infant. Data shows the prevalence of peanut allergies lines up with when AAP guidelines started recommending that babies do NOT eat peanut. Israel never went along with this and thus has the lowest rates of allergies in the world.
Latest research does strongly suggest that introducing small amounts of common allergens (peanuts, shellfish,milk products...) as early as possible does significantly reduce risk for allergies later. Many early childhood organisations already recommend this. Official health recommendations are often slow to catch up (often for good reasons, but introducing peanuts etc. early is already officially recommended in quite a few countries (Australia, NZ, Sweden for example AFAIK). Not all health professionals are always up to date either though.
Spot on. From the linked blog post "The refrain of La Bamba, the Mexican folk song that Ritchie Valens made famous, goes: Para bailar La Bamba/Se necesita una poca de Gracia. "
https://lifearchitect.ai/models-table/
Love those GPQA scores hovering around 5% when chance (on 4-way multi-choice) would have got them 25%!
or.. A stopped clock is right twice a day; a mis-prompted LLM is wrong 19 times out of 20—but only because we handed it the wrong instruction sheet.
Procedural error in testing perhaps? I'm not familiar with the methodology for GPQA.
So could do better than chance by excluding the option it's picked?
A stopped clock is right twice a day, but a running clock set to the wrong time is always wrong.
> a running clock set to the wrong time is always wrong.
Could be right within 15 min accuracy in the appropriate timezone. And such a mechanism can be corrected for in the postprocessing step.
Not always true! Your statement is only true when the running clock's speed is the same as time. Thus, regular time and the clock's time will never meet.
If the clock is running faster than regular time, it will at point catch up to regular time and thus be correct for a split second. If the clock is slower than regular time, regular time will catch up to the clock and the clock will be right for a split second.
If the clock is running backwards at very high speed, it would be right infinitely many times but the proportion of the time that it is right would approach some finite constant.
If we are being pedantic, running clocks never run exactly the same as time. So they'll be right (very) much more seldom than the stopped clock, which is right twice a day.
The RMS of wrongness of the running clock is probably lower.
Wonder if the name is inspired by my favorite snack, bamba. The best are the hazelnut bamba.
Btw bamba if given to kids at a young age can drastically reduce the chance of peanut allergies
Let me show you the etymology of Bamba:
SSM (state space model) -> SSSM (structured state space model) -> (it's like a snake ssss...) Mamba -> Bamba
Where does the B come from?
SSM = state-space model, for the unfamiliar.
https://en.wikipedia.org/wiki/State-space_representation
> chose to make just about everything associated with Bamba open-source — the training recipes, the data, the data loader IBM designed for largescale distributed training, and a quantization framework aimed at shaving storage and inferencing costs.
This type of architecture is definitely the future. Unlimited attn is a dead end. As a human you don't need to scan an entire book just to guess what the next word will be and LLMs shouldn't need that either.
Humans can re-attend to material whenever necessary (i.e you can just re-read a book, re-watch a documentary etc when you feel you have missed crucial context) so it's not the end of the world. These SSMs or modern RNNs can't and if crucial context has been discarded by the end of the query then well too bad. Transformers are of course always re-attending so not an issue for them either. Until that issue is resolved, i don't think attention will be going anywhere.
Not be contrarian, but if the next word prediction happens to be someone's name or a place or something discussed multiple places in the book then often, yes, a knowledge of the full plot of the book is "required" just to predict the next word, as you get to the middle or end of a book.
For example you could never fill in the last chapter of any good book without having knowledge of every previous chapter. Not highly detailed knowledge, but still knowledge.
What an LLM does is stuff it all into short term memory. Humans dump the first pages into long term memory and "make sense" of it. Humans have a massive context window because of this (and sheer brain size and efficiency).
We don’t put things into long term memory after we read it. We usually put it after night of sleep. I personally think that context (and kv cache correspondingly) in the models are akin to our short term memory, while training process (and actual weights) are to our long term memory. And we can’t be sure our short term memory doesn’t work in a way of matching the current context towards currently stored short term memory. From this perspective transformers are enough and just fine.
So if you now hide my original comment and try to recall what I said, do you know it word for word (and are thinking if every word, e.g. did I use one or 2 spaces somewhere as that would change tokens) or do you have a rough concept of what I said?
OTOH if you had to remember a phone number to write it down, how does that differ?
I think in a way it makes transformers superior to humans, their short term memory is much more powerful =) Supporting extra long contexts also make transformers super human. Because, again, human's short term memory is exactly this - short term. And much shorter than millions of tokens we expect from models nowadays.
As for SSMs - I think they compress model memory state way too much. Mixed global/local attention layers do just as well. And sparse/block attention seems like a way forward much more (https://arxiv.org/abs/2502.11089).
Human memory works completely different. It's not information stored in neurons, or being computed. My theory is that Eternalism ('Block Universe' is the other term) is real and that all memory is accomplished by the fact that your brain remains Quantum Entangled with all past and future "copies" of itself.
You know what "copies" means if you understand Eternalism. Each moment in the past "still" exists, and always exists. Probably about 50% of the best Physicists believe 'crazy' concepts like this (and multiverse, etc), even though it sounds crazy to uneducated people. The only thing that differentiates us is which crazy idea each of us buys into.
I believe all high negentropy systems can interact with their own future/past via the long causality chain going in both directions. This is how non-brain cell based intelligences work too (like Fungi and Plant life). Memory and Consciousness is essentially a retro-causal wave resonance across the entire causality chain of complex systems. We know for sure consciousness is far more correlated to waves than it is to synaptic actions. Neurons and synapses just carry signals. All the memory in consciousness is in the wave domain, as a secondary emergent effect of neuron charge flows.
Never got how mamba models work in multiple dimensions and non-causally.
For some reason this link isn't loading, but it's on https://archive.ph/Ks0xt
Where's the code?
I could find these two resources: Hugging Face: https://huggingface.co/collections/ibm-ai-platform/bamba-674... GitHub: https://github.com/foundation-model-stack/bamba
Another recent transformer/SSM hybrid is "M1", with a more than 3x claimed inference speed-up compared to equivalent transformers: https://arxiv.org/pdf/2504.10449
IBM is claiming at least a 2x inference speed-up with Bamba. Both groups say that future SSM optimizations to vLLM would lead to further inference speed improvement.
the name bamba is killing me lol, all i can see is the snack now
LLM/state space models have been popular for some years now, see: https://arxiv.org/abs/2212.14052
More recently, hybrid architectures that utilize attention plus other operators are gaining traction.
See https://arxiv.org/abs/2503.01868
Dear IBM name pickers: "Bamba", in Italian, means cocaine.
When I read the title 'IBM crossed a transformer with an SSM and got ‘Bamba’' I laughed so hard I woke up my kid
It's just a mamba (https://github.com/state-spaces/mamba) but with a transformer. Idk where the B comes from.
And in Heberw it's the name of a snack made of peanut-butter-flavored puffed maize https://en.wikipedia.org/wiki/Bamba_(snack)
I imported these to America to feed my infant. Data shows the prevalence of peanut allergies lines up with when AAP guidelines started recommending that babies do NOT eat peanut. Israel never went along with this and thus has the lowest rates of allergies in the world.
You actually don't need to self import these. Usually Safeway (is it only a west coast thing?) always have these stocked in the Kosher section.
I think the difference in allergy rates between UK and Israeli Ashkenazi Jews (10x higher in UK Jews!) [1] is strong evidence for that.
Also, they sell Bamba at Trader Joe’s now.
[1] https://www.jacionline.org/article/S0091-6749(08)01698-9/ful...
Latest research does strongly suggest that introducing small amounts of common allergens (peanuts, shellfish,milk products...) as early as possible does significantly reduce risk for allergies later. Many early childhood organisations already recommend this. Official health recommendations are often slow to catch up (often for good reasons, but introducing peanuts etc. early is already officially recommended in quite a few countries (Australia, NZ, Sweden for example AFAIK). Not all health professionals are always up to date either though.
As an Italian who has tried (only) the Israeli Bamba, I can certify that it is pretty addictive.
So someone can get fired for picking IBM after all! Or get a bonus, depending on the organization...
Maybe?
https://en.m.wikipedia.org/wiki/Bamba_(snack)
;)
Or
https://en.wikipedia.org/wiki/La_Bamba_(song)
Or (where I'm from) a school cafeteria:
https://www.thelocal.se/20221125/swedish-word-of-the-day-bam...
Spot on. From the linked blog post "The refrain of La Bamba, the Mexican folk song that Ritchie Valens made famous, goes: Para bailar La Bamba/Se necesita una poca de Gracia. "
A very funny and friendly way to say "cocaine" among italians. I'm struggling to read it seriously.
and in Portuguese, it means "flimsy". What a great name.
Para bailar La Bamba / Se necesita una poca de gracia
Seems like a good fit.
And in Lithuanian it's a navel
about time they did something to liven things up at big blue
SSMs never stop
i mean that sounds good to me
Yummy