The way I look at transformers is: they have been one of the most fertile inventions in recent history. Originally released in 2017, in the subsequent 8 years they completely transformed (heh) multiple fields, and at least partially led to one Nobel prize.
realistically, I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.
I'm skeptical that we'll see a big breakthrough in the architecture itself. As sick as we all are of transformers, they are really good universal approximators. You can get some marginal gains, but how more _universal_ are you realistically going to get? I could be wrong, and I'm glad there are researchers out there looking at alternatives like graphical models, but for my money we need to look further afeild. Reconsider the auto-regressive task, cross entropy loss, even gradient descent optimization itself.
The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].
I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].
If all your problems with attention are actually just problems with softmax, then that's an easy fix. Delete softmax lmao.
No but seriously, just fix the fucking softmax. Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that, or replace softmax with any of the almost-softmax-but-not-really candidates. Plenty of options there.
The reason why we're "benchmaxxing" is that benchmarks are the metrics we have, and the only way by which we can sift through this gajillion of "revolutionary new architecture ideas" and get at the ones that show any promise at all. Of which there are very few, and fewer still that are worth their gains when you account for: there not being an unlimited amount of compute. Especially not when it comes to frontier training runs.
Memorization vs generalization is a well known idiot trap, and we are all stupid dumb fucks in the face of applied ML. Still, some benchmarks are harder to game than others (guess how we found that out), and there's power in that.
I think something with more uniform training and inference setups, and otherwise equally hardware friendly, just as easily trainable, and equally expressive could replace transformers.
Which fields have they completely transformed? How was it before and how is it now? I won't pretend like it hasn't impacted my field, but I would say the impact is almost entirely negative.
Everyone who did NLP research or product discovery in the past 5 years had to pivot real hard to salvage their shit post-transformers. They're very disruptively good at most NLP task
edit: post-transformers meaning "in the era after transformers were widely adopted" not some mystical new wave of hypothetical tech to disrupt transformers themselves.
Sorry but you didn't really answer the question. The original claim was that transformers changed a whole bunch of fields, and you listed literally the one thing language models are directly useful for.. modeling language.
I think this might be the ONLY example that doesn't back up the original claim, because of course an advancement in language processing is an advancement in language processing -- that's tautological! every new technology is an advancement in its domain; what's claimed to be special about transformers is that they are allegedly disruptive OUTSIDE of NLP. "Which fields have been transformed?" means ASIDE FROM language processing.
other than disrupting users by forcing "AI" features they don't want on them... what examples of transformers being revolutionary exist outside of NLP?
I think they meant fields of research. If you do anything in NLP, CV, inverse-problem solving or simulations, things have changed drastically.
Some directly, because LLMs and highly capable general purpose classifiers that might be enough for your use case are just out there, and some because of downstream effects, like GPU-compute being far more common, hardware optimized for tasks like matrix multiplication and mature well-maintained libraries with automatic differentiation capabilities. Plus the emergence of things that mix both classical ML and transformers, like training networks to approximate intermolecular potentials faster than the ab-initio calculation, allowing for accelerating molecular dynamics simulations.
Transformers aren't only used in language processing. They're very useful in image processing, video, audio, etc. They're kind of like a general-purpose replacement for RNNs that are better in many ways.
The goal was never to answer the question. So what if it's worse. It's not worse for the researchers. It's not worse for the CEOs and the people who work for the AI companies. They're bathing in the limelight so their actual goal, as they would state it to themselves, is: "To get my bit of the limelight"
>The final conversation on Sewell’s screen was with a chatbot in the persona of Daenerys Targaryen, the beautiful princess and Mother of Dragons from “Game of Thrones.”
>
>“I promise I will come home to you,” Sewell wrote. “I love you so much, Dany.”
>
>“I love you, too,” the chatbot replied. “Please come home to me as soon as possible, my love.”
>
>“What if I told you I could come home right now?” he asked.
>
>“Please do, my sweet king.”
>
>Then he pulled the trigger.
Reading the newspaper is such a lovely experience these days. But hey, the AI researchers are really excited so who really cares if stuff like this happens if we can declare that "therapy is transformed!"
It sure is. Could it have been that attention was all that kid needed?
I'm not watching a video on Twitter about self driving from the company who told us twelve years ago that completely autonomous vehicles were a year away as a rebuttal to the point I made.
If you have something relevant to say, you can summarize for the class & include links to your receipts.
So, unless this went r/woosh over my head....how is current AI better than shit post-transformers? If all....old shit post-transformers are at least deterministic or open and not a randomized shitbox.
Unless I misinterpreted the post, render me confused.
I wasn't too clear, I think. Apologies if the wording was confusing.
People who started their NLP work (PhDs etc; industry research projects) before the LLM / transformer craze had to adapt to the new world. (Hence 'post-mass-uptake-of-transformers')
in the super public consumer space, search engines / answer engines (like chatgpt) are the big ones.
on the other hand it's also led to improvements in many places hidden behind the scenes. for example, vision transformers are much more powerful and scalable than many of the other computer vision models which has probably led to new capabilities.
in general, transformers aren't just "generate text" but it's a new foundational model architecture which enables a leap step in many things which require modeling!
Transformers also make for a damn good base to graft just about any other architecture onto.
Like, vision transformers? They seem to work best when they still have a CNN backbone, but the "transformer" component is very good at focusing on relevant information, and doing different things depending on what you want to be done with those images.
And if you bolt that hybrid vision transformer to an even larger language-oriented transformer? That also imbues it with basic problem-solving, world knowledge and commonsense reasoning capabilities - which, in things like advanced OCR systems, are very welcome.
In computer vision transformers have basically taken over most perception fields. If you look at paperswithcode benchmarks it’s common to find like 10/10 recent winners being transformer based against common CV problems. Note, I’m not talking about VLMs here, just small ViTs with a few million parameters. YOLOs and other CNNs are still hanging around for detection but it’s only a matter of time.
Spam detection and phishing detection are completely different than 5 years ago, as one cannot rely on typos and grammar mistakes to identify bad content.
Spam, scams, propaganda, and astroturfing are easily the largest beneficiaries of LLM automation, so far. LLMs are exactly the 100x rocket-boots their boosters are promising for other areas (without such results outside a few tiny, but sometimes important, niches, so far) when what you're doing is producing throw-away content at enormous scale and have a high tolerance for mistakes, as long as the volume is high.
It seems unfair to call out LLMs for "spam, scams, propaganda, and astroturfing." These problems are largely the result of platform optimization for engagement and SEO competition for attention. This isn't unique to models; even we, humans, when operating without feedback, generate mostly slop. Curation is performed by the environment and the passage of time, which reveals consequences. LLMs taken in isolation from their environment are just as sloppy as brains in a similar situation.
Therefore, the correct attitude to take regarding LLMs is to create ways for them to receive useful feedback on their outputs. When using a coding agent, have the agent work against tests. Scaffold constraints and feedback around it. AlphaZero, for example, had abundant environmental feedback and achieved amazing (superhuman) results. Other Alpha models (for math, coding, etc.) that operated within validation loops reached olympic levels in specific types of problem-solving. The limitation of LLMs is actually a limitation of their incomplete coupling with the external world.
In fact you don't even need a super intelligent agent to make progress, it is sufficient to have copying and competition, evolution shows it can create all life, including us and our culture and technology without a very smart learning algorithm. Instead what it has is plenty of feedback. Intelligence is not in the brain or the LLM, it is in the ecosystem, the society of agents, and the world. Intelligence is the result of having to pay the cost of our execution to continue to exist, a strategy to balance the cost of life.
What I mean by feedback is exploration, when you execute novel actions or actions in novel environment configurations, and observe the outcomes. And adjust, and iterate. So the feedback becomes part of the model, and the model part of the action-feedback process. They co-create each other.
> It seems unfair to call out LLMs for "spam, scams, propaganda, and astroturfing." These problems are largely the result of platform optimization for engagement and SEO competition for attention.
They didn't create those markets, but they're the markets for which LLMs enhance productivity and capability the best right now, because they're the ones that need the least supervision of input to and output from the LLMs, and they happen to be otherwise well-suited to the kind of work it is, besides.
> This isn't unique to models; even we, humans, when operating without feedback, generate mostly slop.
I don't understand the relevance of this.
> Curation is performed by the environment and the passage of time, which reveals consequences.
It'd say it's revealed by human judgement and eroded by chance, but either way, I still don't get the relevance.
> LLMs taken in isolation from their environment are just as sloppy as brains in a similar situation.
Sure? And clouds are often fluffy. Water is often wet. Relevance?
The rest of this is a description of how we can make LLMs work better, which amounts to more work than required to make LLMs pay off enormously for the purposes I called out, so... are we even in disagreement? I don't disagree that perhaps this will change, and explicitly bound my original claim ("so far") for that reason.
... are you actually demonstrating my point, on purpose, by responding with LLM slop?
LLMs can generate slop if used without good feedback or trying to minimize human contribution. But the same LLMs can filter out the dark patterns. They can use search and compare against dozens or hundreds of web pages, which is like the deep research mode outputs. These reports can still contain mistakes, but we can iterate - generate multiple deep reports from different models with different web search tools, and then do comparative analysis once more. There is no reason we should consume raw web full of "spam, scams, propaganda, and astroturfing" today.
For a good while I joked that I could easily write a bot that makes more interesting conversation than you. The human slop will drown in AI slop. Looks like we wil need to make more of an effort when publishing if not develop our own personality.
> It seems unfair to call out LLMs for "spam, scams, propaganda, and astroturfing."
You should hear HN talk about crypto. If the knife were invented today they'd have a field day calling it the most evil plaything of bandits, etc. Nothing about human nature, of course.
It’s had an impact on software for sure. Now I have to fix my coworker’s AI slop code all the time. I guess it could be a positive for my job security. But acting like “AI” has had a wildly positive impact on software seems, at best, a simplification and, at worst, the opposite of reality.
Given that we can train a transformer model by shoveling large amounts of inert text at it, and then use it to compose original works and solve original problems with the addition of nothing more than generic computing power, we can conclude that there's nothing special about what the human brain does.
All that remains is to come up with a way to integrate short-term experience into long-term memory, and we can call the job of emulating our brains done, at least in principle. Everything after that will amount to detail work.
> we can conclude that there's nothing special about what the human brain does
...lol. Yikes.
I do not accept your premise. At all.
> use it to compose original works and solve original problems
Which original works and original problems have LLMs solved, exactly? You might find a random article or stealth marketing paper that claims to have solved some novel problem, but if what you're saying were actually true, we'd be flooded with original works and new problems being solved. So where are all these original works?
> All that remains is to come up with a way to integrate short-term experience into long-term memory, and we can call the job of emulating our brains done, at least in principle
What experience do you have that caused you to believe these things?
No, the burden of proof is on you to deliver. You are the claimant, you provide the proof. You made a drive-by assertion with no evidence or even arguments.
I also do not accept your assertion, at all. Humans largely function on the basis of desire-fulfilment, be that eating, fucking, seeking safety, gaining power, or any of the other myriad human activities. Our brains, and the brains of all the animals before us, have evolved for that purpose. For evidence, start with Skinner or the millions of behavioral analysis studies done in that field.
Our thoughts lend themselves to those activities. They arise from desire. Transformers have nothing to do with human cognition because they do not contain the basic chemical building blocks that precede and give rise to human cognition. They are, in fact, stochastic parrots, that can fool others, like yourself, into believing they are somehow thinking.
[1] Libet, B., Gleason, C. A., Wright, E. W., & Pearl, D. K. (1983). Time of conscious intention to act in relation to onset of cerebral activity (readiness-potential). Brain, 106(3), 623-642.
[2] Soon, C. S., Brass, M., Heinze, H. J., & Haynes, J. D. (2008). Unconscious determinants of free decisions in the human brain. Nature Neuroscience, 11(5), 543-545.
[3] Berridge, K. C., & Robinson, T. E. (2003). Parsing reward. Trends in Neurosciences, 26(9), 507-513. (This paper reviews the "wanting" vs. "liking" distinction, where unconscious "wanting" or desire is driven by dopamine).
[4] Kavanagh, D. J., Andrade, J., & May, J. (2005). Elaborated Intrusion theory of desire: a multi-component cognitive model of craving. British Journal of Health Psychology, 10(4), 515-532. (This model proposes that desires begin as unconscious "intrusions" that precede conscious thought and elaboration).
I was just saying that it's fine if you don't accept my premise, but that doesn't change the reality of the premise.
The International Math Olympiad qualifies as solving original problems, for example. If you disagree, that's a case you have to make. Transformer models are unquestionably better at math than I am. They are also better at composition, and will soon be better at programming if they aren't already.
Every time a magazine editor is fooled by AI slop, every time an entire subreddit loses the Turing test to somebody's ethically-questionable 'experiment', every time an AI-rendered image wins a contest meant for human artists -- those are original works.
Heck, looking at my Spotify playlist, I'd be amazed if I haven't already been fooled by AI-composed music. If it hasn't happened yet, it will probably happen next week, or maybe next year. Certainly within the next five years.
> think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.
As somebody who was a biiiiig user of probabilistic graphical models, and felt kind of left behind in this brave new world of stacked nets, I would love for my prior knowledge and experience to become valuable for a broader set of problem domains. However, I don't see it yet. Hope you are right!
+1, I am also big user of PGMs, and also a big user of transformers, and I don't know what the parent comment talking about, beyond that for e.g. LLMs, sampling the next token can be thought of as sampling from a conditional distribution (of the next token, given previous tokens). However, this connection of using transformers to sample from conditional distributions is about autoregressive generation and training using next-token prediction loss, not about the transformer architecture itself, which mostly seems to be good because it is expressive and scalable (i.e. can be hardware-optimized).
Source: I am a PhD student, this is kinda my wheelhouse
> I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area
I agree. Causal inference and symbolic reasoning would SUPER juicy nuts to crack , more so than what we got from transformers.
I have my own probabilistic hyper-graph model which I have never written down in an article to share. You see people converging on this idea all over if you’re looking for it.
Yeah I think this is definitely the future. Recently, I too have spent considerable time on probabilistic hyper-graph models in certain domains of science. Maybe it _is_ the next big thing.
It's difficult to do because of how well matched they are to the hardware we have. They were partially designed to solve the mismatch between RNNs and GPUs, and they are way too good at it. If you come up with something truly new, it's quite likely you have to influence hardware makers to help scale your idea. That makes any new idea fundamentally coupled to hardware, and that's the lesson we should be taking from this. Work on the idea as a simultaneous synthesis of hardware and software. But, it also means that fundamental change is measured in decade scales.
I get the impulse to do something new, to be radically different and stand out, especially when everyone is obsessing over it, but we are going to be stuck with transformers for a while.
This is backwards. Algorithms that can be parallelized are inherently superior, independent of the hardware. GPUs were built to take advantage of the superiority and handle all kinds of parallel algorithms well - graphics, scientific simulation, signal processing, some financial calculations, and on and on.
There’s a reason so much engineering effort has gone into speculative execution, pipelining, multicore design etc - parallelism is universally good. Even when “computers” were human calculators, work was divided into independent chunks that could be done simultaneously. The efficiency comes from the math itself, not from the hardware it happens to run on.
RNNs are not parallelizable by nature. Each step depends on the output of the previous one. Transformers removed that sequential bottleneck.
Haha, I like to joke that we were on track for the singularity in 2024, but it stalled because the research time gap between "profitable" and "recursive self-improvement" was just a bit too long that we're now stranded on the transformer model for the next two decades until every last cent has been extracted from it.
There's massive hardware and energy infra built out going on. None of that is specialized to run only transformers at this point, so wouldn't that create a huge incentive to find newer and better architectures to get the most out of all this hardware and energy infra?
Only being able to run transformers is a silly concept, because attention consists of two matrix multiplications, which are the standard operation in feed forward and convolutional layers. Basically, you get transformers for free.
> Now, as CTO and co-founder of Tokyo-based Sakana AI, Jones is explicitly abandoning his own creation. "I personally made a decision in the beginning of this year that I'm going to drastically reduce the amount of time that I spend on transformers," he said. "I'm explicitly now exploring and looking for the next big thing."
So, this is really just a BS hype talk. This is just trying to get more funding and VCs.
Why wouldn't this both be an attempt to get funding and also him wanting to do something new? Certainly if he was wanting to do something new he'd want it funded, too?
He sounds a lot like how some people behave when they reach a "top". Suddenly that thing seems unworthy all of a sudden. It's one of the reasons you'll see your favorite music artist totally go a different direction on their next album. It's an artistic process almost. There's a core arrogance involved, that you were responsible for the outcome and can easily create another great outcome.
Many researchers who invent something new and powerful pivot quickly to something new. that's because they're researchers, and incentive is to develop new things that subsume the old things. Other researchers will continue to work on improving existing things and finding new applications to existing problems, but they rarely get as much attention as the folks who "discover" something new.
Why "arrogance"? There are music artists that truly enjoy making music and don't just see their purpose in maximizing financial success and fan service?
There are other considerations that don't revolve around money, but I feel it's arrogant to assume success is the only motivation for musicians.
Sans money, it's arrogant because we know talent is god-given. You are basically betting again that your natural given trajectory has more leg room for more incredible output. It's not a bad bet at all, but it is a bet. Some talent is so incredible that it takes a while for the ego to accept its limits. Jordan tried to come back at 40 and Einstein fought quantum mechanics unto death. Accepting the limits has nothing to do with mediocrity, and everything to do with humility. You can still have an incredible trajectory beyond belief (which I believe this person has and will have).
Einstein also got his nobel prize for basically discovering quanta. I'm not sure he fought it so much as tried to figure what's going on with it which is still kind of unknown.
Or a core fear, that you'll never do something as good in the same vein as the smash hit you already made, so you strike off in a completely different direction.
When you're overpressured to succeed, it makes a lot of sense to switch up your creative process in hopes of getting something new or better.
It doesn't mean that you'll get good results by abandoning prior art, either with LLMs or musicians. But it does signal a sort of personal stress and insecurity, for sure.
It's a good process (although, many take it to its common conclusion which is self-destruction). It's why the most creative people are able to re-invent themselves. But one must go into everything with both eyes open, and truly humble themselves with the possibility that that may have been the greatest achievement of their life, never to be matched again.
I wonder if he can simply sit back and bask in the glory of being one of the most important people during the infancy of AI. Someone needs to interview this guy, would love to see how he thinks.
What "AI" means for most people is the software product they see, but only a part of it is the underlying machine learning model. Each foundation model receives additional training from thousands of humans, often very lowly paid, and then many prompts are used to fine-tune it all. It's 90% product development, not ML research.
If you look at AI research papers, most of them are by people trying to earn a PhD so they can get a high-paying job. They demonstrate an ability to understand the current generation of AI and tweak it, they create content for their CVs.
There is actual research going on, but it's tiny share of everything, does not look impressive because it's not a product, or a demo, but an experiment.
I have a feeling there is more research being done on non-transformer based architectures now, not less. The tsunami of money pouring in to make the next chatbot powered CRM doesn’t care about that though, so it might seem to be less.
I would also just fundamentally disagree with the assertion that a new architecture will be the solution. We need better methods to extract more value from the data that already exists. Ilya Sutskever talked about this recently. You shouldn’t need the whole internet to get to a decent baseline. And that new method may or may not use a transformer, I don’t think that is the problem.
I think you misunderstood the article a bit by saying that the assertion is "that a new architecture will be the solution". That's not the assertion. It's simply a statement about the lack of balance between exploration and exploitation. And the desire to rebalance it. What's wrong with that?
It looks like almost every AI researcher and lab who existed pre-2017 is now focused on transformers somehow. I agree the total number of researchers has increased, but I suspect the ratio has moved faster, so there are now fewer total non-transformer researchers.
Well, we also still use wheels despite them being invented thousands of years ago. We have added tons of improvements on top though, just as transformers have. The fact that wheels perform poorly in mud doesn’t mean you throw out the concept of wheels. You add treads to grip the ground better.
If you check the DeepSeek OCR paper it shows text based tokenization may be suboptimal. Also all of the MoE stuff, reasoning, and RLHF. The 2017 paper is pretty primitive compared to what we have now.
I think it may be a future roadblock quite soon. If you look at all the data centers planned and speed of it, it's going to be a job getting the energy. xAI hacked it by putting about 20 gas turbines around their data center which is giving locals health problems from the pollution. I imagine that sort of thing will be cracked down on.
If there’s a legit long term demand for energy the market will figure it out. I doubt that will be a long term issue. It’s just a short term one because of the gold rush. But innovation doesn’t have to happen overnight. The world doesn’t live or die on a subset of VC funds not 100xing within a certain timeframe
Or it’s possible China just builds the power capabilities faster because they actually build new things
I think people care too much about trying to innovate a new model architecture. Models are meant to create a compressed representation of its training data. Even if you came up with a more efficient compression, the capabilities of the model wouldn't be any better. What is more relevant is finding more efficient ways of training, like the shift to reinforcement learning these days.
But isn't the max training efficiency naturally tied to the architecture? Meaning other architecture have another training efficiency landscape? I've said it somewhere else: It is not about "caring too much about new model architecture" but to have a balance between exploitation and exploration.
I ask myself how much the focus of this industry on transformer models is informed by the ease of computation on GPUs/NPUs, and whether better AI technology is possible but would require much greater computing power on traditional hardware architectures. We depend so much on traditional computation architectures, it might be a real blinder. My brain doesn't need 500 Watts, at least I hope so.
>The project, he said, was "very organic, bottom up," born from "talking over lunch or scrawling randomly on the whiteboard in the office."
Many of the breakthrough and game changing inventions were done this way with the back of the envelope discussions, the other popular example was the Ethernet network.
Some good stories of similar culture in AT&T Bell lab is well described in the Hamming's book [1].
[1] Stripe Press The Art of Doing Science and Engineering:
All transformative inventions and innovations seems to come from similar scenarios like "I was playing around with these things" or "I just met X at lunch and we discussed ...".
I'm wondering how big impact work from home will really have on humanity in general, when so many of our life changing discoveries comes from the odd chance of two specific people happening to be in the same place at some moment in time.
What you say is true, but let’s not forget that Ken Thompson did the first version of unix in 3 weeks while his wife had gone to California with their child to visit relatives, so deep focus is important too.
It seems, in those days, people at Bell Labs did get the best of both worlds: being able to have chance encounters with very smart people while also being able to just be gone for weeks to work undistracted.
A dream job that probably didn’t even feel like a job (at least that’s the impression I get from hearing Thompson talk about that time).
I'd go back to the office in a heartbeat provided it was an actual office. And not an "open-office" layout, that people are forced to try to concentrate with all the noise and people passing behind them constantly.
The agile treadmill (with PM's breathing down our necks) and features getting planned and delivered in 2 week-sprints, has also reduced our ability to just do something we feel needs getting done. Today you go to work to feed several layers of incompetent managers - there is no room for play, or for creativity. At least in most orgs I know.
I think innovation (or even joy of being at work) needs more than just the office, or people, or a canteen, but an environment that supports it.
Personally, I try to under-promise on what I think I can do every sprint specifically so I can spend more time mentoring more junior engineers, brainstorming random ideas, and working on stuff that nobody has called out as something that needs working on yet.
Basically, I set aside as much time as I can to squeeze in creativity and real engineering work into the job. Otherwise I'd go crazy from the grind of just cranking out deliverables
We have an open office surrounded by "breakout offices". I simply squat in one of the offices (I take most meetings over video chat), as do most of the other principals. I don't think I could do my job in an office if I couldn't have a room to work in most of the time.
As for agile: I've made it clear to my PMs that I generally plan on a quarterly/half year basis and my work and other people's work adheres to that schedule, not weekly sprints (we stay up to date in a slack channel, no standups)
And it is always felt to me that has lineage from neural Turing machine line of work as prior. The transformative part was 1. find a good task (machine translation) and a reasonable way to stack (encoder-decoder architecture); 2. run the experiment; 3. ditch the external KV store idea and just use self-projected KV.
"""Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout,[11] and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.[9]"""
Something which I haven't been able to fully parse that perhaps someone has better insight into: aren't transformers inherently only capable of inductive reasoning? In order to actually progress to AGI, which is being promised at least as an eventuality, don't models have to be capable of deduction? Wouldn't that mean fundamentally changing the pipeline in some way? And no, tools are not deduction. They are useful patches for the lack of deduction.
Models need to move beyond the domain of parsing existing information into existing ideas.
I don't see any reason to think that transformers are not capable of deductive reasoning. Stochasticity doesn't rule out that ability. It just means the model might be wrong in its deduction, just like humans are sometimes wrong.
That sounds like a category mistake to me. A proof assistant or logic-programming system performs deduction, and just strapping one of those to an LLM hasn't gotten us to "AGI".
The other big missing part here is the enormous incentives (and punishments if you don't) to publish in the big three AI conferences. And because quantity is being rewarded far more than quantity, the meta is to do really shoddy and uninspired work really quickly. The people I talk to have a 3 month time horizon on their projects.
My opinion on the "Attention is all you need" paper is that its most important idea is the Positional Encoding. The transformer head itself... is just another NN block among many.
tl;dr: AI is built on top of science done by people just "doing research", and transformers took off so hard that those same people now can't do any meaningful, real AI research anymore because everyone only wants to pay for "how to make this one single thing that everyone else is also doing, better" instead of being willing to fund research into literally anything else.
It's like if someone invented the hamburger and every single food outlet decided to only serve hamburgers from that point on, only spending time and money on making the perfect hamburger, rather than spending time and effort on making great meals. Which sounds ludicrously far-fetched, but is exactly what happened here.
Good points, and it made me have a mini epiphany...
i think you analogously just described Sun Microsystems, where Unixes (BSD originally in their case, generalized to SVR4 (?) hybrid later) worked soooo well, that NT was built as a hybridization for the Microsoft user base and Apple reabsorbed the BSD-Mach-DisplayPostscript hybridization spinoff NeXT, while Linux simultaneously thrived.
These are evolutionary dead ends, sorry that I'm not inspired enough to see it any other way, this transformer based direction is good enough
The LLM stack has enough branches of evolution within it for efficiency, agent-based work can power a new industrial revolution specifically around white collar workers on its own, while expanding the self-expression for personal fulfillment for everyone else
The way I look at transformers is: they have been one of the most fertile inventions in recent history. Originally released in 2017, in the subsequent 8 years they completely transformed (heh) multiple fields, and at least partially led to one Nobel prize.
realistically, I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.
I'm skeptical that we'll see a big breakthrough in the architecture itself. As sick as we all are of transformers, they are really good universal approximators. You can get some marginal gains, but how more _universal_ are you realistically going to get? I could be wrong, and I'm glad there are researchers out there looking at alternatives like graphical models, but for my money we need to look further afeild. Reconsider the auto-regressive task, cross entropy loss, even gradient descent optimization itself.
There are many many problems with attention.
The softmax has issues regarding attention sinks [1]. The softmax also causes sharpness problems [2]. In general this decision boundary being Euclidean dot products isn't actually optimal for everything, there are many classes of problem where you want polyhedral cones [3]. Positional embedding are also janky af and so is rope tbh, I think Cannon layers are a more promising alternative for horizontal alignment [4].
I still think there is plenty of room to improve these things. But a lot of focus right now is unfortunately being spent on benchmaxxing using flawed benchmarks that can be hacked with memorization. I think a really promising and underappreciated direction is synthetically coming up with ideas and tests that mathematically do not work well and proving that current arhitectures struggle with it. A great example of this is the VITs need glasses paper [5], or belief state transformers with their star task [6]. The Google one about what are the limits of embedding dimensions also is great and shows how the dimension of the QK part is actually important to getting good retrevial [7].
[1] https://arxiv.org/abs/2309.17453
[2] https://arxiv.org/abs/2410.01104
[3] https://arxiv.org/abs/2505.17190
[4] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5240330
[5] https://arxiv.org/abs/2406.04267
[6] https://arxiv.org/abs/2410.23506
[6] https://arxiv.org/abs/2508.21038
If all your problems with attention are actually just problems with softmax, then that's an easy fix. Delete softmax lmao.
No but seriously, just fix the fucking softmax. Add a dedicated "parking spot" like GPT-OSS does and eat the gradient flow tax on that, or replace softmax with any of the almost-softmax-but-not-really candidates. Plenty of options there.
The reason why we're "benchmaxxing" is that benchmarks are the metrics we have, and the only way by which we can sift through this gajillion of "revolutionary new architecture ideas" and get at the ones that show any promise at all. Of which there are very few, and fewer still that are worth their gains when you account for: there not being an unlimited amount of compute. Especially not when it comes to frontier training runs.
Memorization vs generalization is a well known idiot trap, and we are all stupid dumb fucks in the face of applied ML. Still, some benchmarks are harder to game than others (guess how we found that out), and there's power in that.
I think something with more uniform training and inference setups, and otherwise equally hardware friendly, just as easily trainable, and equally expressive could replace transformers.
BDH
Yeah that thing is quite interesting - baby dragon hatchling https://news.ycombinator.com/item?id=45668408 https://youtu.be/mfV44-mtg7c
Which fields have they completely transformed? How was it before and how is it now? I won't pretend like it hasn't impacted my field, but I would say the impact is almost entirely negative.
Everyone who did NLP research or product discovery in the past 5 years had to pivot real hard to salvage their shit post-transformers. They're very disruptively good at most NLP task
edit: post-transformers meaning "in the era after transformers were widely adopted" not some mystical new wave of hypothetical tech to disrupt transformers themselves.
Sorry but you didn't really answer the question. The original claim was that transformers changed a whole bunch of fields, and you listed literally the one thing language models are directly useful for.. modeling language.
I think this might be the ONLY example that doesn't back up the original claim, because of course an advancement in language processing is an advancement in language processing -- that's tautological! every new technology is an advancement in its domain; what's claimed to be special about transformers is that they are allegedly disruptive OUTSIDE of NLP. "Which fields have been transformed?" means ASIDE FROM language processing.
other than disrupting users by forcing "AI" features they don't want on them... what examples of transformers being revolutionary exist outside of NLP?
Claude Code? lol
I think they meant fields of research. If you do anything in NLP, CV, inverse-problem solving or simulations, things have changed drastically.
Some directly, because LLMs and highly capable general purpose classifiers that might be enough for your use case are just out there, and some because of downstream effects, like GPU-compute being far more common, hardware optimized for tasks like matrix multiplication and mature well-maintained libraries with automatic differentiation capabilities. Plus the emergence of things that mix both classical ML and transformers, like training networks to approximate intermolecular potentials faster than the ab-initio calculation, allowing for accelerating molecular dynamics simulations.
Transformers aren't only used in language processing. They're very useful in image processing, video, audio, etc. They're kind of like a general-purpose replacement for RNNs that are better in many ways.
The goal was never to answer the question. So what if it's worse. It's not worse for the researchers. It's not worse for the CEOs and the people who work for the AI companies. They're bathing in the limelight so their actual goal, as they would state it to themselves, is: "To get my bit of the limelight"
>The final conversation on Sewell’s screen was with a chatbot in the persona of Daenerys Targaryen, the beautiful princess and Mother of Dragons from “Game of Thrones.” > >“I promise I will come home to you,” Sewell wrote. “I love you so much, Dany.” > >“I love you, too,” the chatbot replied. “Please come home to me as soon as possible, my love.” > >“What if I told you I could come home right now?” he asked. > >“Please do, my sweet king.” > >Then he pulled the trigger.
Reading the newspaper is such a lovely experience these days. But hey, the AI researchers are really excited so who really cares if stuff like this happens if we can declare that "therapy is transformed!"
It sure is. Could it have been that attention was all that kid needed?
https://x.com/aelluswamy/status/1981760576591393203
saving lives
I'm not watching a video on Twitter about self driving from the company who told us twelve years ago that completely autonomous vehicles were a year away as a rebuttal to the point I made.
If you have something relevant to say, you can summarize for the class & include links to your receipts.
So, unless this went r/woosh over my head....how is current AI better than shit post-transformers? If all....old shit post-transformers are at least deterministic or open and not a randomized shitbox.
Unless I misinterpreted the post, render me confused.
I wasn't too clear, I think. Apologies if the wording was confusing.
People who started their NLP work (PhDs etc; industry research projects) before the LLM / transformer craze had to adapt to the new world. (Hence 'post-mass-uptake-of-transformers')
I think you're misinterpreting: "with the advent of transformers, (many) people doing NLP with pre-transformers techniques had to salvage their shit"
There's no post-transformer tech. There are lots of NLP tasks that you can now, just, prompt an LLM to do.
Yeah unclear wording; see the sibling comment also. I meant "the tech we have now", in the era after "attention is all you need"
in the super public consumer space, search engines / answer engines (like chatgpt) are the big ones.
on the other hand it's also led to improvements in many places hidden behind the scenes. for example, vision transformers are much more powerful and scalable than many of the other computer vision models which has probably led to new capabilities.
in general, transformers aren't just "generate text" but it's a new foundational model architecture which enables a leap step in many things which require modeling!
Transformers also make for a damn good base to graft just about any other architecture onto.
Like, vision transformers? They seem to work best when they still have a CNN backbone, but the "transformer" component is very good at focusing on relevant information, and doing different things depending on what you want to be done with those images.
And if you bolt that hybrid vision transformer to an even larger language-oriented transformer? That also imbues it with basic problem-solving, world knowledge and commonsense reasoning capabilities - which, in things like advanced OCR systems, are very welcome.
Genomics, protein structure prediction, various forms of small molecule and large molecule drug discovery.
In computer vision transformers have basically taken over most perception fields. If you look at paperswithcode benchmarks it’s common to find like 10/10 recent winners being transformer based against common CV problems. Note, I’m not talking about VLMs here, just small ViTs with a few million parameters. YOLOs and other CNNs are still hanging around for detection but it’s only a matter of time.
hah well, transformative doesn't necessarily mean positive!
All we get is distraction.
Out of curiosity, what field are you in?
Spam detection and phishing detection are completely different than 5 years ago, as one cannot rely on typos and grammar mistakes to identify bad content.
Spam, scams, propaganda, and astroturfing are easily the largest beneficiaries of LLM automation, so far. LLMs are exactly the 100x rocket-boots their boosters are promising for other areas (without such results outside a few tiny, but sometimes important, niches, so far) when what you're doing is producing throw-away content at enormous scale and have a high tolerance for mistakes, as long as the volume is high.
It seems unfair to call out LLMs for "spam, scams, propaganda, and astroturfing." These problems are largely the result of platform optimization for engagement and SEO competition for attention. This isn't unique to models; even we, humans, when operating without feedback, generate mostly slop. Curation is performed by the environment and the passage of time, which reveals consequences. LLMs taken in isolation from their environment are just as sloppy as brains in a similar situation.
Therefore, the correct attitude to take regarding LLMs is to create ways for them to receive useful feedback on their outputs. When using a coding agent, have the agent work against tests. Scaffold constraints and feedback around it. AlphaZero, for example, had abundant environmental feedback and achieved amazing (superhuman) results. Other Alpha models (for math, coding, etc.) that operated within validation loops reached olympic levels in specific types of problem-solving. The limitation of LLMs is actually a limitation of their incomplete coupling with the external world.
In fact you don't even need a super intelligent agent to make progress, it is sufficient to have copying and competition, evolution shows it can create all life, including us and our culture and technology without a very smart learning algorithm. Instead what it has is plenty of feedback. Intelligence is not in the brain or the LLM, it is in the ecosystem, the society of agents, and the world. Intelligence is the result of having to pay the cost of our execution to continue to exist, a strategy to balance the cost of life.
What I mean by feedback is exploration, when you execute novel actions or actions in novel environment configurations, and observe the outcomes. And adjust, and iterate. So the feedback becomes part of the model, and the model part of the action-feedback process. They co-create each other.
> It seems unfair to call out LLMs for "spam, scams, propaganda, and astroturfing." These problems are largely the result of platform optimization for engagement and SEO competition for attention.
They didn't create those markets, but they're the markets for which LLMs enhance productivity and capability the best right now, because they're the ones that need the least supervision of input to and output from the LLMs, and they happen to be otherwise well-suited to the kind of work it is, besides.
> This isn't unique to models; even we, humans, when operating without feedback, generate mostly slop.
I don't understand the relevance of this.
> Curation is performed by the environment and the passage of time, which reveals consequences.
It'd say it's revealed by human judgement and eroded by chance, but either way, I still don't get the relevance.
> LLMs taken in isolation from their environment are just as sloppy as brains in a similar situation.
Sure? And clouds are often fluffy. Water is often wet. Relevance?
The rest of this is a description of how we can make LLMs work better, which amounts to more work than required to make LLMs pay off enormously for the purposes I called out, so... are we even in disagreement? I don't disagree that perhaps this will change, and explicitly bound my original claim ("so far") for that reason.
... are you actually demonstrating my point, on purpose, by responding with LLM slop?
LLMs can generate slop if used without good feedback or trying to minimize human contribution. But the same LLMs can filter out the dark patterns. They can use search and compare against dozens or hundreds of web pages, which is like the deep research mode outputs. These reports can still contain mistakes, but we can iterate - generate multiple deep reports from different models with different web search tools, and then do comparative analysis once more. There is no reason we should consume raw web full of "spam, scams, propaganda, and astroturfing" today.
For a good while I joked that I could easily write a bot that makes more interesting conversation than you. The human slop will drown in AI slop. Looks like we wil need to make more of an effort when publishing if not develop our own personality.
> It seems unfair to call out LLMs for "spam, scams, propaganda, and astroturfing."
You should hear HN talk about crypto. If the knife were invented today they'd have a field day calling it the most evil plaything of bandits, etc. Nothing about human nature, of course.
Edit: There it is! Like clockwork.
The signals might be different, but the underlying mechanism is still incredibly efficient, no?
AI fan (type 1 -- AI made a big breakthrough) meets AI defender (type 2 -- AI has not fundamentally changed anything that was already a problem).
Defenders are supposed to defend against attacks on AI, but here it misfired, so the conversation should be interesting.
That's because the defender is actually a skeptic of AI. But the first sentence sounded like a typical "nothing to see here" defense of AI.
> but I would say the impact is almost entirely negative.
quite
the transformer innovation was to bring down the cost of producing incorrect, but plausible looking content (slop) in any modality to near zero
not a positive thing for anyone other than spammers
Software, and it’s wildly positive.
Takes like this are utterly insane to me
Wouldn't say it's transformative.
My workflow is transformed. If yours isn’t you’re missing out.
Days that I’d normally feel overwhelmed from requests by management are just Claude Code and chill days now.
It’s had an impact on software for sure. Now I have to fix my coworker’s AI slop code all the time. I guess it could be a positive for my job security. But acting like “AI” has had a wildly positive impact on software seems, at best, a simplification and, at worst, the opposite of reality.
Which fields have they completely transformed?
Simultaneously discovering and leveraging the functional nature of language seems like kind of a big deal.
Can you explain what this means?
Given that we can train a transformer model by shoveling large amounts of inert text at it, and then use it to compose original works and solve original problems with the addition of nothing more than generic computing power, we can conclude that there's nothing special about what the human brain does.
All that remains is to come up with a way to integrate short-term experience into long-term memory, and we can call the job of emulating our brains done, at least in principle. Everything after that will amount to detail work.
> we can conclude that there's nothing special about what the human brain does
...lol. Yikes.
I do not accept your premise. At all.
> use it to compose original works and solve original problems
Which original works and original problems have LLMs solved, exactly? You might find a random article or stealth marketing paper that claims to have solved some novel problem, but if what you're saying were actually true, we'd be flooded with original works and new problems being solved. So where are all these original works?
> All that remains is to come up with a way to integrate short-term experience into long-term memory, and we can call the job of emulating our brains done, at least in principle
What experience do you have that caused you to believe these things?
Which is fine, but it's now clear where the burden of proof lies, and IMHO we have transformer-based language models to thank for that.
If anyone still insists on hidden magical components ranging from immortal souls to Penrose's quantum woo, well... let's see what you've got.
No, the burden of proof is on you to deliver. You are the claimant, you provide the proof. You made a drive-by assertion with no evidence or even arguments.
I also do not accept your assertion, at all. Humans largely function on the basis of desire-fulfilment, be that eating, fucking, seeking safety, gaining power, or any of the other myriad human activities. Our brains, and the brains of all the animals before us, have evolved for that purpose. For evidence, start with Skinner or the millions of behavioral analysis studies done in that field.
Our thoughts lend themselves to those activities. They arise from desire. Transformers have nothing to do with human cognition because they do not contain the basic chemical building blocks that precede and give rise to human cognition. They are, in fact, stochastic parrots, that can fool others, like yourself, into believing they are somehow thinking.
[1] Libet, B., Gleason, C. A., Wright, E. W., & Pearl, D. K. (1983). Time of conscious intention to act in relation to onset of cerebral activity (readiness-potential). Brain, 106(3), 623-642.
[2] Soon, C. S., Brass, M., Heinze, H. J., & Haynes, J. D. (2008). Unconscious determinants of free decisions in the human brain. Nature Neuroscience, 11(5), 543-545.
[3] Berridge, K. C., & Robinson, T. E. (2003). Parsing reward. Trends in Neurosciences, 26(9), 507-513. (This paper reviews the "wanting" vs. "liking" distinction, where unconscious "wanting" or desire is driven by dopamine).
[4] Kavanagh, D. J., Andrade, J., & May, J. (2005). Elaborated Intrusion theory of desire: a multi-component cognitive model of craving. British Journal of Health Psychology, 10(4), 515-532. (This model proposes that desires begin as unconscious "intrusions" that precede conscious thought and elaboration).
I had edited my comment, I think you replied before I saved it.
I was just saying that it's fine if you don't accept my premise, but that doesn't change the reality of the premise.
The International Math Olympiad qualifies as solving original problems, for example. If you disagree, that's a case you have to make. Transformer models are unquestionably better at math than I am. They are also better at composition, and will soon be better at programming if they aren't already.
Every time a magazine editor is fooled by AI slop, every time an entire subreddit loses the Turing test to somebody's ethically-questionable 'experiment', every time an AI-rendered image wins a contest meant for human artists -- those are original works.
Heck, looking at my Spotify playlist, I'd be amazed if I haven't already been fooled by AI-composed music. If it hasn't happened yet, it will probably happen next week, or maybe next year. Certainly within the next five years.
Humans hallucinate too, but there's usually dysfunction, and it's not expected as a normal operational output.
>If anyone still insists on hidden magical components ranging from immortal souls to Penrose's quantum woo, well... let's see what you've got.
This isn't too far off from the marketing and hypesteria surrounding "AI" companies.
> think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area for research exploration for the foreseeable future.
As somebody who was a biiiiig user of probabilistic graphical models, and felt kind of left behind in this brave new world of stacked nets, I would love for my prior knowledge and experience to become valuable for a broader set of problem domains. However, I don't see it yet. Hope you are right!
+1, I am also big user of PGMs, and also a big user of transformers, and I don't know what the parent comment talking about, beyond that for e.g. LLMs, sampling the next token can be thought of as sampling from a conditional distribution (of the next token, given previous tokens). However, this connection of using transformers to sample from conditional distributions is about autoregressive generation and training using next-token prediction loss, not about the transformer architecture itself, which mostly seems to be good because it is expressive and scalable (i.e. can be hardware-optimized).
Source: I am a PhD student, this is kinda my wheelhouse
> I think the valuable idea is probabilistic graphical models- of which transformers is an example- combining probability with sequences, or with trees and graphs- is likely to continue to be a valuable area
I agree. Causal inference and symbolic reasoning would SUPER juicy nuts to crack , more so than what we got from transformers.
Not doubting in any way, but what are some fields it transformed
I have my own probabilistic hyper-graph model which I have never written down in an article to share. You see people converging on this idea all over if you’re looking for it.
Wish there were more hours in the day.
Yeah I think this is definitely the future. Recently, I too have spent considerable time on probabilistic hyper-graph models in certain domains of science. Maybe it _is_ the next big thing.
> probabilistic graphical models- of which transformers is an example
Having done my PhD in probabilistic programming... what?
It's difficult to do because of how well matched they are to the hardware we have. They were partially designed to solve the mismatch between RNNs and GPUs, and they are way too good at it. If you come up with something truly new, it's quite likely you have to influence hardware makers to help scale your idea. That makes any new idea fundamentally coupled to hardware, and that's the lesson we should be taking from this. Work on the idea as a simultaneous synthesis of hardware and software. But, it also means that fundamental change is measured in decade scales.
I get the impulse to do something new, to be radically different and stand out, especially when everyone is obsessing over it, but we are going to be stuck with transformers for a while.
This is backwards. Algorithms that can be parallelized are inherently superior, independent of the hardware. GPUs were built to take advantage of the superiority and handle all kinds of parallel algorithms well - graphics, scientific simulation, signal processing, some financial calculations, and on and on.
There’s a reason so much engineering effort has gone into speculative execution, pipelining, multicore design etc - parallelism is universally good. Even when “computers” were human calculators, work was divided into independent chunks that could be done simultaneously. The efficiency comes from the math itself, not from the hardware it happens to run on.
RNNs are not parallelizable by nature. Each step depends on the output of the previous one. Transformers removed that sequential bottleneck.
Haha, I like to joke that we were on track for the singularity in 2024, but it stalled because the research time gap between "profitable" and "recursive self-improvement" was just a bit too long that we're now stranded on the transformer model for the next two decades until every last cent has been extracted from it.
There's massive hardware and energy infra built out going on. None of that is specialized to run only transformers at this point, so wouldn't that create a huge incentive to find newer and better architectures to get the most out of all this hardware and energy infra?
>None of that is specialized to run only transformers at this point
isn't this what [etched](https://www.etched.com/) is doing?
Only being able to run transformers is a silly concept, because attention consists of two matrix multiplications, which are the standard operation in feed forward and convolutional layers. Basically, you get transformers for free.
devil is in the details
how do you know we're not at recursive self-improvement but the rate is just slower than human-mediated improvement?
> Now, as CTO and co-founder of Tokyo-based Sakana AI, Jones is explicitly abandoning his own creation. "I personally made a decision in the beginning of this year that I'm going to drastically reduce the amount of time that I spend on transformers," he said. "I'm explicitly now exploring and looking for the next big thing."
So, this is really just a BS hype talk. This is just trying to get more funding and VCs.
Attention is all he needs.
Reminds me of the headline I saw a long time ago: “50 years later, inventor of the pixel says he’s sorry that he made it square.”
Sadly, he probably needs a lot more or he's gonna go all Maslow...
Why wouldn't this both be an attempt to get funding and also him wanting to do something new? Certainly if he was wanting to do something new he'd want it funded, too?
It's also how curious scientists operate, they're always itching for something creative and different.
Well he got your attention didn't he?
anyone know what they're trying to sell here?
The ability to do original, academic research without the pressure to build something marketable.
probably AI
It would be hype talk if he said and my next big thing is X.
Well, that's why he needs funding. Hasn't figured out what the next big thing is.
He sounds a lot like how some people behave when they reach a "top". Suddenly that thing seems unworthy all of a sudden. It's one of the reasons you'll see your favorite music artist totally go a different direction on their next album. It's an artistic process almost. There's a core arrogance involved, that you were responsible for the outcome and can easily create another great outcome.
Many researchers who invent something new and powerful pivot quickly to something new. that's because they're researchers, and incentive is to develop new things that subsume the old things. Other researchers will continue to work on improving existing things and finding new applications to existing problems, but they rarely get as much attention as the folks who "discover" something new.
Also, not all researchers have the fortune of doing the research they would want to. If he can do it, it would be foolish not to take the opportunity.
Why "arrogance"? There are music artists that truly enjoy making music and don't just see their purpose in maximizing financial success and fan service?
There are other considerations that don't revolve around money, but I feel it's arrogant to assume success is the only motivation for musicians.
Sans money, it's arrogant because we know talent is god-given. You are basically betting again that your natural given trajectory has more leg room for more incredible output. It's not a bad bet at all, but it is a bet. Some talent is so incredible that it takes a while for the ego to accept its limits. Jordan tried to come back at 40 and Einstein fought quantum mechanics unto death. Accepting the limits has nothing to do with mediocrity, and everything to do with humility. You can still have an incredible trajectory beyond belief (which I believe this person has and will have).
Einstein also got his nobel prize for basically discovering quanta. I'm not sure he fought it so much as tried to figure what's going on with it which is still kind of unknown.
That’s just normal human behaviour to have evolving interests
Arrogance would be if explicitly chose to abandon it because he thought he was better
Or a core fear, that you'll never do something as good in the same vein as the smash hit you already made, so you strike off in a completely different direction.
Sometimes it just turns out like Michael Jordan playing baseball.
When you're overpressured to succeed, it makes a lot of sense to switch up your creative process in hopes of getting something new or better.
It doesn't mean that you'll get good results by abandoning prior art, either with LLMs or musicians. But it does signal a sort of personal stress and insecurity, for sure.
It's a good process (although, many take it to its common conclusion which is self-destruction). It's why the most creative people are able to re-invent themselves. But one must go into everything with both eyes open, and truly humble themselves with the possibility that that may have been the greatest achievement of their life, never to be matched again.
I wonder if he can simply sit back and bask in the glory of being one of the most important people during the infancy of AI. Someone needs to interview this guy, would love to see how he thinks.
If it was about money it would probably be easier to double down on something proven to make revenue rather than something that doesn't even exist.
Edit: there is a cult around transformers.
His AI company is called "Fish AI"?? Does it mean their AI will have the intelligence of a fish?
It's about collective intelligence, as seen in swarms of ants or fish.
Or Fishy?
Without transformers, maybe.
/s
Hope were not talking about eels
What "AI" means for most people is the software product they see, but only a part of it is the underlying machine learning model. Each foundation model receives additional training from thousands of humans, often very lowly paid, and then many prompts are used to fine-tune it all. It's 90% product development, not ML research.
If you look at AI research papers, most of them are by people trying to earn a PhD so they can get a high-paying job. They demonstrate an ability to understand the current generation of AI and tweak it, they create content for their CVs.
There is actual research going on, but it's tiny share of everything, does not look impressive because it's not a product, or a demo, but an experiment.
I have a feeling there is more research being done on non-transformer based architectures now, not less. The tsunami of money pouring in to make the next chatbot powered CRM doesn’t care about that though, so it might seem to be less.
I would also just fundamentally disagree with the assertion that a new architecture will be the solution. We need better methods to extract more value from the data that already exists. Ilya Sutskever talked about this recently. You shouldn’t need the whole internet to get to a decent baseline. And that new method may or may not use a transformer, I don’t think that is the problem.
The assertion, or maybe idea, that a new architecture may be the thing is kind of about building AGI rather than chatbots.
Like humans think about things and learn which may require some differences from feed the internet in to pre-train your transformer.
I think you misunderstood the article a bit by saying that the assertion is "that a new architecture will be the solution". That's not the assertion. It's simply a statement about the lack of balance between exploration and exploitation. And the desire to rebalance it. What's wrong with that?
It looks like almost every AI researcher and lab who existed pre-2017 is now focused on transformers somehow. I agree the total number of researchers has increased, but I suspect the ratio has moved faster, so there are now fewer total non-transformer researchers.
Well, we also still use wheels despite them being invented thousands of years ago. We have added tons of improvements on top though, just as transformers have. The fact that wheels perform poorly in mud doesn’t mean you throw out the concept of wheels. You add treads to grip the ground better.
If you check the DeepSeek OCR paper it shows text based tokenization may be suboptimal. Also all of the MoE stuff, reasoning, and RLHF. The 2017 paper is pretty primitive compared to what we have now.
Transformers have sucked up all the attention and money. And AI scientists have been sucked in to the transformer-is-prime industry.
We will spend more time in the space until we see bigger roadblocks.
I really wished energy consumption was a very big roadblock that forced them into still researching.
I think it may be a future roadblock quite soon. If you look at all the data centers planned and speed of it, it's going to be a job getting the energy. xAI hacked it by putting about 20 gas turbines around their data center which is giving locals health problems from the pollution. I imagine that sort of thing will be cracked down on.
If there’s a legit long term demand for energy the market will figure it out. I doubt that will be a long term issue. It’s just a short term one because of the gold rush. But innovation doesn’t have to happen overnight. The world doesn’t live or die on a subset of VC funds not 100xing within a certain timeframe
Or it’s possible China just builds the power capabilities faster because they actually build new things
I think people care too much about trying to innovate a new model architecture. Models are meant to create a compressed representation of its training data. Even if you came up with a more efficient compression, the capabilities of the model wouldn't be any better. What is more relevant is finding more efficient ways of training, like the shift to reinforcement learning these days.
But isn't the max training efficiency naturally tied to the architecture? Meaning other architecture have another training efficiency landscape? I've said it somewhere else: It is not about "caring too much about new model architecture" but to have a balance between exploitation and exploration.
I think a transformer wrote this article, seeing a suspicious number of em dashes in the last section
The next big AI architectural fad will be "disrupters".
Maybe even 'terminators'
I ask myself how much the focus of this industry on transformer models is informed by the ease of computation on GPUs/NPUs, and whether better AI technology is possible but would require much greater computing power on traditional hardware architectures. We depend so much on traditional computation architectures, it might be a real blinder. My brain doesn't need 500 Watts, at least I hope so.
>The project, he said, was "very organic, bottom up," born from "talking over lunch or scrawling randomly on the whiteboard in the office."
Many of the breakthrough and game changing inventions were done this way with the back of the envelope discussions, the other popular example was the Ethernet network.
Some good stories of similar culture in AT&T Bell lab is well described in the Hamming's book [1].
[1] Stripe Press The Art of Doing Science and Engineering:
https://press.stripe.com/the-art-of-doing-science-and-engine...
All transformative inventions and innovations seems to come from similar scenarios like "I was playing around with these things" or "I just met X at lunch and we discussed ...".
I'm wondering how big impact work from home will really have on humanity in general, when so many of our life changing discoveries comes from the odd chance of two specific people happening to be in the same place at some moment in time.
What you say is true, but let’s not forget that Ken Thompson did the first version of unix in 3 weeks while his wife had gone to California with their child to visit relatives, so deep focus is important too.
It seems, in those days, people at Bell Labs did get the best of both worlds: being able to have chance encounters with very smart people while also being able to just be gone for weeks to work undistracted.
A dream job that probably didn’t even feel like a job (at least that’s the impression I get from hearing Thompson talk about that time).
Perhaps this is why we see AI devotees congregate in places like SF - increased probability
I'd go back to the office in a heartbeat provided it was an actual office. And not an "open-office" layout, that people are forced to try to concentrate with all the noise and people passing behind them constantly.
The agile treadmill (with PM's breathing down our necks) and features getting planned and delivered in 2 week-sprints, has also reduced our ability to just do something we feel needs getting done. Today you go to work to feed several layers of incompetent managers - there is no room for play, or for creativity. At least in most orgs I know.
I think innovation (or even joy of being at work) needs more than just the office, or people, or a canteen, but an environment that supports it.
Personally, I try to under-promise on what I think I can do every sprint specifically so I can spend more time mentoring more junior engineers, brainstorming random ideas, and working on stuff that nobody has called out as something that needs working on yet.
Basically, I set aside as much time as I can to squeeze in creativity and real engineering work into the job. Otherwise I'd go crazy from the grind of just cranking out deliverables
yeah that sounds like a good strategy to avoid burn-out.
We have an open office surrounded by "breakout offices". I simply squat in one of the offices (I take most meetings over video chat), as do most of the other principals. I don't think I could do my job in an office if I couldn't have a room to work in most of the time.
As for agile: I've made it clear to my PMs that I generally plan on a quarterly/half year basis and my work and other people's work adheres to that schedule, not weekly sprints (we stay up to date in a slack channel, no standups)
And it is always felt to me that has lineage from neural Turing machine line of work as prior. The transformative part was 1. find a good task (machine translation) and a reasonable way to stack (encoder-decoder architecture); 2. run the experiment; 3. ditch the external KV store idea and just use self-projected KV.
Related thread:https://threadreaderapp.com/thread/1864023344435380613.html
True in creativity too.
According to various stories pieced together, the ideas of 4 of Pixar’s early hits were conceived on or around one lunch.
Bug’s Life, Wall-E, Monsters, Inc
The fourth one is Finding Nemo
One of the OG Unix guys (was it Kernighan?) literally specced out UTF-8 on a cocktail napkin.
Thompson and Pike: https://en.wikipedia.org/wiki/UTF-8
"""Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout,[11] and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.[9]"""
Something which I haven't been able to fully parse that perhaps someone has better insight into: aren't transformers inherently only capable of inductive reasoning? In order to actually progress to AGI, which is being promised at least as an eventuality, don't models have to be capable of deduction? Wouldn't that mean fundamentally changing the pipeline in some way? And no, tools are not deduction. They are useful patches for the lack of deduction.
Models need to move beyond the domain of parsing existing information into existing ideas.
I don't see any reason to think that transformers are not capable of deductive reasoning. Stochasticity doesn't rule out that ability. It just means the model might be wrong in its deduction, just like humans are sometimes wrong.
That sounds like a category mistake to me. A proof assistant or logic-programming system performs deduction, and just strapping one of those to an LLM hasn't gotten us to "AGI".
A proof assistant is a verifier, and a tool so therefor a patch, so I really fail to see how that could be understood as the LLM having deduction.
They can induct just can’t generate new ideas. Its not going to discover a new quark without a human in the loop somewhere
maybe that's a good thing after all.
If anyone has a video if it I think we'd all very much appreciate you posting a link. I've tried and I can't find one.
The other big missing part here is the enormous incentives (and punishments if you don't) to publish in the big three AI conferences. And because quantity is being rewarded far more than quantity, the meta is to do really shoddy and uninspired work really quickly. The people I talk to have a 3 month time horizon on their projects.
And here I thought this would be about Transformers: Robots in Disguise. The form of transformers I'm tired of hearing about.
My opinion on the "Attention is all you need" paper is that its most important idea is the Positional Encoding. The transformer head itself... is just another NN block among many.
I'm tired of feeling like the articles I read are AI generated.
Isn’t Sakana the one that got flack for falsely advertising its CUDA codegen abilities?
Of course he's sick. He could have made billions.
Money has diminishing returns. Not everyone wants to buy Twitter.
When you have your (next) lightbulb moment, how would you monetize such an idea? Royalties? 1c after each request?
Leave and raise a round right away.
But attention is all he needs.
tl;dr: AI is built on top of science done by people just "doing research", and transformers took off so hard that those same people now can't do any meaningful, real AI research anymore because everyone only wants to pay for "how to make this one single thing that everyone else is also doing, better" instead of being willing to fund research into literally anything else.
It's like if someone invented the hamburger and every single food outlet decided to only serve hamburgers from that point on, only spending time and money on making the perfect hamburger, rather than spending time and effort on making great meals. Which sounds ludicrously far-fetched, but is exactly what happened here.
Good points, and it made me have a mini epiphany...
i think you analogously just described Sun Microsystems, where Unixes (BSD originally in their case, generalized to SVR4 (?) hybrid later) worked soooo well, that NT was built as a hybridization for the Microsoft user base and Apple reabsorbed the BSD-Mach-DisplayPostscript hybridization spinoff NeXT, while Linux simultaneously thrived.
Dude now I want a hamburger :(
These are evolutionary dead ends, sorry that I'm not inspired enough to see it any other way, this transformer based direction is good enough
The LLM stack has enough branches of evolution within it for efficiency, agent-based work can power a new industrial revolution specifically around white collar workers on its own, while expanding the self-expression for personal fulfillment for everyone else
Well have fun sir
^AI psychosis, never underestimate its effects.
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
And here I thought this would be about Transformers: Robots in Disguise. The form of transformers I'm tired of hearing about.
And the decepticons.