> We used Trillium TPUs to train the new Gemini 2.0,
Wow. I knew custom Google silicon was used for inference, but I didn't realize it was used for training too. Does this mean Google is free of dependence on Nvidia GPUs? That would be a huge advantage over AI competitors.
Google silicon TPUs have been used for training for at least 5 years, probably more (I think it's 10 years). They do not depend on nvidia gpus for the majority of their projects. Took TPUs a while to catch up on some details, like sparsity.
This aligns with my knowledge. I don't know much about LLM, but TPU has been used for training deep prediction models in ads at least from 2018, though there were some gap filled by CPU/GPU for a while. Nowaday, TPU capacity is probably more than the combination of CPU/GPU.
at some point some researchers were begging for GPUs... mainly for sparse work. I think that's why sparsecore was added to TPU (https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...) in v4. I think at this point with their tech turnaround time they can catch up as competitors add new features and researchers want to use them.
Mostly embedding, but IIRC DeepMind RL made use of sparsity- basically, huge matrices with only a few non-zero elements.
BarnaCore existed and was used, but was tailored mostly for embeddings. BTW, IIRC they were called that because they were added "like a barnacle hanging off the side".
The evolution of TPU has been interesting to watch; I came from the HPC and supercomputing space, and seeing Google as mostly-CPU for the longest time, and then finally learning how to build "supercomputers" over a decade+ (gradually adding many features that classical supercomputers have long had), was a very interesting process. Some very expensive mistakes along the way. But now they've paid down almost all the expensive up-front costs and can now ride on the margins, adding new bits and pieces while increasing the clocks and capacities on a cadence.
My understanding is that the Trillium TPU was primarily targeted at inference (so it’s surprising to see it was used to train Gemini 2.0) but other generations of TPUs have targeted training. For example the chip prior to this one is called TPU v5p and was targeted toward training.
That is one factor, but another is total cost of ownership. At large scales something that's 1/2 the overall speed but 1/3rd the total cost is still a net win by a large margin. This is one of the reasons why every major hyperscaler is, to some extent, developing their own hardware e.g. Meta, who famously have an insane amount of Nvidia GPUs.
Of course this does not mean their models will necessarily be proportionally better, nor does it mean Google won't buy GPUs for other reasons (like providing them to customers on Google Cloud.)
+1 on this. The tooling to use TPUs still needs more work. But we are betting on building this tooling and unlocking these ASIC chips (https://github.com/felafax/felafax).
[Google employee] Yes, you can use TPUs in Compute Engine and GKE, among other places, for whatever you'd like. I just checked and the v6 are available.
It's in the article: "When training the Llama-2-70B model, our tests demonstrate that Trillium achieves near-linear scaling from a 4-slice Trillium-256 chip pod to a 36-slice Trillium-256 chip pod at a 99% scaling efficiency."
I'm pretty sure they're doing fine-tune training, using Llama because it is a widely known and available sample. They used SDXL elsewhere for the same reason.
Llama 2 was released well over a year ago and was training between Meta and Microsoft.
Llama 2 end weights are public. The data used to train it, or even the process used to train it, are not. Google can't just train another Llama 2 from scratch.
They could train something similar, but it'd be super weird if they called it Llama 2. They could call it something like "Gemini", or if it's open weights, "Gemma".
How good is Trillium/TPU compared to Nvidia? It seems the stats are: tpu v6e achieves 900 TFLOPS per chip (fp16) while Nvidia H100 achieves 1800 TFLOPS per gpu? (fp16)?
1800 on the h100s is with 2/4 sparsity, it’s half of that without. Not sure if the tpu number is doing that too, but I don’t think 2/4 is used that heavily so I probably would compare without it.
So Google has Trillium, Amazon has Trainium, Apple is working on a custom chip with Broadcom, etc. Nvidia’s moat doesn’t seem that big.
Plus big tech companies have the data and customers and will probably be the only surviving big AI training companies. I doubt startups can survive this game - they can’t afford the chips, can’t build their own, don’t have existing products to leech data off of, and don’t have control over distribution channels like OS or app stores
It seems this way, but we've been saying this for years and years. And somehow nvidia keeps making more and more.
Isn't it telling when Google's release of an "AI" chip doesn't include a single reference to nvidia or its products? They're releasing it for general availability, for people to build solutions on, so it's pretty weird that there isn't comparisons to H100s et al. All of their comparisons are to their own prior generations, which you do if you're the leader (e.g. Apple does it with their chips), but it's a notable gap when you're a contender.
Google posted TPUv6 results for a few things on MLCommons in August. You can compare them to H100 over there, at least for inference on stable diffusion xl.
Suspiciously there is a filter for "TPU-trillium" in the training results table, but no result using such an accelerator. Maybe results were there and later redacted, or have been embargoed.
The biggest barrier for any Nvidia competitor is that hackers can run the models on their desktop. You don't need a cloud provider specific model to do stuff locally.
Well, look at it this way. Nvidia played their cards so well that their competitors had to invent entirely new product categories to supplant their demand for Nvidia hardware. This new hardware isn't even reprising the role of CUDA, just the subset of tensor operations that are used for training and AI inference. If demand for training and inference wanes, these hardware investments will be almost entirely wasted.
Nvidia's core competencies - scaling hardware up and down, providing good software interfaces and selling direct to consumer are not really assailed at all. The big lesson Nvidia is giving to the industry is that you should invest in complex GPU architectures and write the software to support it. Currently the industry is trying it's hardest to reject that philosophy, and only time will tell if they're correct.
Maybe it won't - I say "time will tell" because we really do not know how much LLMs will be demanded in 10 years. Nvidia's stock skyrocketed because they were incidentally prepared for an enormous increase in demand the moment it happened. Now that expectations are cooling down and Sam Altman is signalling that AGI is a long ways off, the math that justified designing NPU/TPU hardware in-house might not add up anymore. Even if you believe in the tech itself, the hype is cooling and the do-or-die moment is rapidly approaching.
My overall point is that I think Nvidia played smartly from the start. They could derive profit from any sufficiently large niche their competitors were too afraid to exploit, and general purpose GPU compute was the perfect investment. With AMD, Apple and the rest of the industry focusing on simpler GPUs, Nvidia was given an empty soapbox to market CUDA with. The big question is whether demand for CUDA can be supplanted with application-specific accelerators.
> The big question is whether demand for CUDA can be supplanted with application-specific accelerators.
At least for AI workloads, Google's XLA compiler and the JAX ML framework have reduced the need for something like CUDA.
There are two main ways to train ML models today:
1) Kernel-heavy approach: This is where frameworks like PyTorch are used, and developers write custom kernels (using Triton or CUDA) to speed up certain ops.
2) Compiler-heavy approach: This uses tools like XLA, which apply techniques like op fusion and compiler optimizations to automatically generate fast, low-level code.
NVIDIA's CUDA is a major strength in the first approach. But if the second approach gains more traction, NVIDIA’s advantage might not be as important.
And I think the second approach has a strong chance of succeeding, given that two massive companies—Google (TPUs) and Amazon (Trainium)—are heavily investing in it.
It's weird to me that folks think NVDA is just sitting there, waiting for everyone to take their lunch. Yes, I'm totally sure NVDA is completely blind to competition and has chosen to sit on their cash rather than develop alternatives...</s>
> If demand for training and inference wanes, these hardware investments will be almost entirely wasted
Nvidia also need to invent smth then, as pumping mining (or giving good to gamers) again is not sexy.
What's next? Will we finally compute for drug development and achieve just as great results as with chatbots?
> Nvidia also need to invent smth then, as pumping mining (or giving good to gamers) again is not sexy.
They do! Their research page is well worth checking out, they wrote a lot of the fundamental papers that people cite for machine learning today: https://research.nvidia.com/publications
> Will we finally compute for drug development and achieve just as great results as with chatbots?
Maybe - but they're not really analogous problem spaces. Fooling humans with text is easy - Markov chains have been doing it for decades. Automating the discovery of drugs and research of proteins is not quite so easy, rudimentary attempts like Folding@Home went on for years without any breakthrough discoveries. It's going to take a lot more research before we get to ChatGPT levels of success. But tools like CUDA certainly help with this by providing flexible compute that's easy to scale.
There was nothing rudimentary about Folding@Home (either in the MD engine or the MSM clustering method), and my paper on GPCRs that used Folding@Home regularly gets cites from pharma (we helped establish the idea that treating proteins as being a single structure at the global energy minimum was too simplistic to design drugs). But F@H was never really a serious attempt at drug discovery- it was intended to probe the underlying physics of protein folding, which is tangentially related.
In drug discovery, we'd love to be able to show that virtual screening really worked- if you could do docking against a protein to find good leads affordably, and also ensure that the resulting leads were likely to pass FDA review (IE, effective and non-toxic), that could potentially greatly increase the rate of discovery.
"we constantly strive to enhance the performance and efficiency of our Mamba and Jamba language models."
...
"The growing importance of multi-step reasoning at inference time necessitates accelerators that can efficiently handle the increased computational demands."
Unlike others, my main concern with AI is any savings we got from converting petroleum generating plants to wind/solar, it was blasted away by AI power consumption months or even years ago. Maybe Microsoft is on to something with the TMI revival.
Energy availability at this point appears (to me at least) to be practically infinite. In the sense that it is technically finite but not for any definition of the word that is meaningful for Earth or humans at this stage.
I don't see our current energy production scaling to meet the demands of AI. I see a lot of signals that most AI players feel the same. From where I'm sitting, AI is already accelerating energy generation to meet demand.
If your goal is to convert the planet to clean energy, AI seems like one of the most effective engines for doing that. It's going to drive the development of new technologies (like small modular nuclear reactors) pushing down the cost of construction and ownership. Strongly suspect that, in 50 years, the new energy tech that AI drives development of will have rendered most of our current energy infrastructure worthless.
We will abandon current forms of energy production not because they were "bad" but because they were rendered inefficient.
This has been a constant thought for me as well. Like, the plan from what I can tell is that we are going to start to spinning all this stuff up every single time someone searches something on google, or perhaps, when someone would otherwise search something on there.
Just that alone feels like an absolutely massive load to bear! But its only a drop in the bucket compared to everything else around this stuff.
But while I may be thirsty and hungry in the future, at least I will (maybe) be able to know how many rs are in "strawberry".
> We used Trillium TPUs to train the new Gemini 2.0,
Wow. I knew custom Google silicon was used for inference, but I didn't realize it was used for training too. Does this mean Google is free of dependence on Nvidia GPUs? That would be a huge advantage over AI competitors.
Google silicon TPUs have been used for training for at least 5 years, probably more (I think it's 10 years). They do not depend on nvidia gpus for the majority of their projects. Took TPUs a while to catch up on some details, like sparsity.
This aligns with my knowledge. I don't know much about LLM, but TPU has been used for training deep prediction models in ads at least from 2018, though there were some gap filled by CPU/GPU for a while. Nowaday, TPU capacity is probably more than the combination of CPU/GPU.
+1, almost all (if not all) Google training runs on TPU. They don't use NVIDIA GPUs at all.
at some point some researchers were begging for GPUs... mainly for sparse work. I think that's why sparsecore was added to TPU (https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...) in v4. I think at this point with their tech turnaround time they can catch up as competitors add new features and researchers want to use them.
dumb question: wdym by sparse work? Is it embedding lookups?
(TPUs have had BarnaCore for efficient embedding lookups since TPU v3)
Mostly embedding, but IIRC DeepMind RL made use of sparsity- basically, huge matrices with only a few non-zero elements.
BarnaCore existed and was used, but was tailored mostly for embeddings. BTW, IIRC they were called that because they were added "like a barnacle hanging off the side".
The evolution of TPU has been interesting to watch; I came from the HPC and supercomputing space, and seeing Google as mostly-CPU for the longest time, and then finally learning how to build "supercomputers" over a decade+ (gradually adding many features that classical supercomputers have long had), was a very interesting process. Some very expensive mistakes along the way. But now they've paid down almost all the expensive up-front costs and can now ride on the margins, adding new bits and pieces while increasing the clocks and capacities on a cadence.
Do they have the equivalent of CUDA, and what is it called?
My understanding is that the Trillium TPU was primarily targeted at inference (so it’s surprising to see it was used to train Gemini 2.0) but other generations of TPUs have targeted training. For example the chip prior to this one is called TPU v5p and was targeted toward training.
Why is that a huge advantage over AI competitors? Just not having to fight for limited Nvidia supply?
That is one factor, but another is total cost of ownership. At large scales something that's 1/2 the overall speed but 1/3rd the total cost is still a net win by a large margin. This is one of the reasons why every major hyperscaler is, to some extent, developing their own hardware e.g. Meta, who famously have an insane amount of Nvidia GPUs.
Of course this does not mean their models will necessarily be proportionally better, nor does it mean Google won't buy GPUs for other reasons (like providing them to customers on Google Cloud.)
TPUs are cheaper and faster than GPUs. But it's custom silicon. Which means barrier to entry is very very high.
> Which means barrier to entry is very very high.
+1 on this. The tooling to use TPUs still needs more work. But we are betting on building this tooling and unlocking these ASIC chips (https://github.com/felafax/felafax).
Vertical integration.
Nvidia is making big bucks "selling shovels in a gold rush". Google has made their own shovel factory and they can avoid paying Nvidia's margins.
Yes. And cheaper operating cost per TFLOP.
Since TPUv2, announced in 2017: https://arstechnica.com/information-technology/2017/05/googl...
The superscalers are all working on this. https://aws.amazon.com/ai/machine-learning/trainium/
TPUs have been used for training since a long time.
(PS: we are startup trying to make TPUs more accessible, if you wanna fine-tune Llama3 on TPU check out https://github.com/felafax/felafax)
Maybe only for their own models
Now any Google customer can use Trillium for training any model?
[Google employee] Yes, you can use TPUs in Compute Engine and GKE, among other places, for whatever you'd like. I just checked and the v6 are available.
Is there not goin to be a v6p?
Can't speculate on futures, but here's the current version log ... https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...
Google trained Llama-2-70B on Trillium chips
I thought llama was trained by meta.
> Google trained Llama
Source? This would make quite the splash in the market
It's in the article: "When training the Llama-2-70B model, our tests demonstrate that Trillium achieves near-linear scaling from a 4-slice Trillium-256 chip pod to a 36-slice Trillium-256 chip pod at a 99% scaling efficiency."
I'm pretty sure they're doing fine-tune training, using Llama because it is a widely known and available sample. They used SDXL elsewhere for the same reason.
Llama 2 was released well over a year ago and was training between Meta and Microsoft.
They can just train another one.
Llama 2 end weights are public. The data used to train it, or even the process used to train it, are not. Google can't just train another Llama 2 from scratch.
They could train something similar, but it'd be super weird if they called it Llama 2. They could call it something like "Gemini", or if it's open weights, "Gemma".
How good is Trillium/TPU compared to Nvidia? It seems the stats are: tpu v6e achieves 900 TFLOPS per chip (fp16) while Nvidia H100 achieves 1800 TFLOPS per gpu? (fp16)?
Would be neat if anyone has benchmarks!!
1800 on the h100s is with 2/4 sparsity, it’s half of that without. Not sure if the tpu number is doing that too, but I don’t think 2/4 is used that heavily so I probably would compare without it.
So Google has Trillium, Amazon has Trainium, Apple is working on a custom chip with Broadcom, etc. Nvidia’s moat doesn’t seem that big.
Plus big tech companies have the data and customers and will probably be the only surviving big AI training companies. I doubt startups can survive this game - they can’t afford the chips, can’t build their own, don’t have existing products to leech data off of, and don’t have control over distribution channels like OS or app stores
It seems this way, but we've been saying this for years and years. And somehow nvidia keeps making more and more.
Isn't it telling when Google's release of an "AI" chip doesn't include a single reference to nvidia or its products? They're releasing it for general availability, for people to build solutions on, so it's pretty weird that there isn't comparisons to H100s et al. All of their comparisons are to their own prior generations, which you do if you're the leader (e.g. Apple does it with their chips), but it's a notable gap when you're a contender.
Google posted TPUv6 results for a few things on MLCommons in August. You can compare them to H100 over there, at least for inference on stable diffusion xl.
Suspiciously there is a filter for "TPU-trillium" in the training results table, but no result using such an accelerator. Maybe results were there and later redacted, or have been embargoed.
The biggest barrier for any Nvidia competitor is that hackers can run the models on their desktop. You don't need a cloud provider specific model to do stuff locally.
This. I suspect consumer brands focusing on consumer hardware are going to make a bigger dent in this space than cloud vendors.
The future of AI is local, not remote.
> Nvidia’s moat doesn’t seem that big.
Well, look at it this way. Nvidia played their cards so well that their competitors had to invent entirely new product categories to supplant their demand for Nvidia hardware. This new hardware isn't even reprising the role of CUDA, just the subset of tensor operations that are used for training and AI inference. If demand for training and inference wanes, these hardware investments will be almost entirely wasted.
Nvidia's core competencies - scaling hardware up and down, providing good software interfaces and selling direct to consumer are not really assailed at all. The big lesson Nvidia is giving to the industry is that you should invest in complex GPU architectures and write the software to support it. Currently the industry is trying it's hardest to reject that philosophy, and only time will tell if they're correct.
> CUDA, just the subset of tensor operations that are used for training and AI inference. If demand for training and inference wanes
Interesting take, but why would demand for training and inference wade? This seems like a very contrarian take.
Maybe it won't - I say "time will tell" because we really do not know how much LLMs will be demanded in 10 years. Nvidia's stock skyrocketed because they were incidentally prepared for an enormous increase in demand the moment it happened. Now that expectations are cooling down and Sam Altman is signalling that AGI is a long ways off, the math that justified designing NPU/TPU hardware in-house might not add up anymore. Even if you believe in the tech itself, the hype is cooling and the do-or-die moment is rapidly approaching.
My overall point is that I think Nvidia played smartly from the start. They could derive profit from any sufficiently large niche their competitors were too afraid to exploit, and general purpose GPU compute was the perfect investment. With AMD, Apple and the rest of the industry focusing on simpler GPUs, Nvidia was given an empty soapbox to market CUDA with. The big question is whether demand for CUDA can be supplanted with application-specific accelerators.
> The big question is whether demand for CUDA can be supplanted with application-specific accelerators.
At least for AI workloads, Google's XLA compiler and the JAX ML framework have reduced the need for something like CUDA.
There are two main ways to train ML models today:
1) Kernel-heavy approach: This is where frameworks like PyTorch are used, and developers write custom kernels (using Triton or CUDA) to speed up certain ops.
2) Compiler-heavy approach: This uses tools like XLA, which apply techniques like op fusion and compiler optimizations to automatically generate fast, low-level code.
NVIDIA's CUDA is a major strength in the first approach. But if the second approach gains more traction, NVIDIA’s advantage might not be as important.
And I think the second approach has a strong chance of succeeding, given that two massive companies—Google (TPUs) and Amazon (Trainium)—are heavily investing in it.
(PS: I'm also bit biased towards approach 2), we build llama3 fine-tuning on TPU https://github.com/felafax/felafax)
It's weird to me that folks think NVDA is just sitting there, waiting for everyone to take their lunch. Yes, I'm totally sure NVDA is completely blind to competition and has chosen to sit on their cash rather than develop alternatives...</s>
> If demand for training and inference wanes, these hardware investments will be almost entirely wasted
Nvidia also need to invent smth then, as pumping mining (or giving good to gamers) again is not sexy. What's next? Will we finally compute for drug development and achieve just as great results as with chatbots?
> Nvidia also need to invent smth then, as pumping mining (or giving good to gamers) again is not sexy.
They do! Their research page is well worth checking out, they wrote a lot of the fundamental papers that people cite for machine learning today: https://research.nvidia.com/publications
> Will we finally compute for drug development and achieve just as great results as with chatbots?
Maybe - but they're not really analogous problem spaces. Fooling humans with text is easy - Markov chains have been doing it for decades. Automating the discovery of drugs and research of proteins is not quite so easy, rudimentary attempts like Folding@Home went on for years without any breakthrough discoveries. It's going to take a lot more research before we get to ChatGPT levels of success. But tools like CUDA certainly help with this by providing flexible compute that's easy to scale.
There was nothing rudimentary about Folding@Home (either in the MD engine or the MSM clustering method), and my paper on GPCRs that used Folding@Home regularly gets cites from pharma (we helped establish the idea that treating proteins as being a single structure at the global energy minimum was too simplistic to design drugs). But F@H was never really a serious attempt at drug discovery- it was intended to probe the underlying physics of protein folding, which is tangentially related.
In drug discovery, we'd love to be able to show that virtual screening really worked- if you could do docking against a protein to find good leads affordably, and also ensure that the resulting leads were likely to pass FDA review (IE, effective and non-toxic), that could potentially greatly increase the rate of discovery.
Are the Gemini models open?
Just the Gemma models are open.
Not even to Google customers most days, it seems.
"we constantly strive to enhance the performance and efficiency of our Mamba and Jamba language models."
... "The growing importance of multi-step reasoning at inference time necessitates accelerators that can efficiently handle the increased computational demands."
Unlike others, my main concern with AI is any savings we got from converting petroleum generating plants to wind/solar, it was blasted away by AI power consumption months or even years ago. Maybe Microsoft is on to something with the TMI revival.
Energy availability at this point appears (to me at least) to be practically infinite. In the sense that it is technically finite but not for any definition of the word that is meaningful for Earth or humans at this stage.
I don't see our current energy production scaling to meet the demands of AI. I see a lot of signals that most AI players feel the same. From where I'm sitting, AI is already accelerating energy generation to meet demand.
If your goal is to convert the planet to clean energy, AI seems like one of the most effective engines for doing that. It's going to drive the development of new technologies (like small modular nuclear reactors) pushing down the cost of construction and ownership. Strongly suspect that, in 50 years, the new energy tech that AI drives development of will have rendered most of our current energy infrastructure worthless.
We will abandon current forms of energy production not because they were "bad" but because they were rendered inefficient.
This has been a constant thought for me as well. Like, the plan from what I can tell is that we are going to start to spinning all this stuff up every single time someone searches something on google, or perhaps, when someone would otherwise search something on there.
Just that alone feels like an absolutely massive load to bear! But its only a drop in the bucket compared to everything else around this stuff.
But while I may be thirsty and hungry in the future, at least I will (maybe) be able to know how many rs are in "strawberry".
Can we please pop this insane Nvidia valuation bubble now?