While working on this research I realized something important: the way most current models are trained is extremely inefficient. Because of that, I started developing *graviton-native*, which trains AI models from scratch using more efficient architectures.
The idea is to design models that are optimized for efficiency from the beginning. My expectation is that this approach could bring around *~70% efficiency improvement*. Combined with OpenGraviton, I believe this could eventually make it possible to run *trillion-parameter scale models locally*.
The mmap layer streaming approach is smart for working around memory limits. In practice though, 1.58-bit ternary quantization tends to degrade quality noticeably on reasoning-heavy tasks compared to 4-bit — curious if you've measured perplexity deltas at the 140B scale.
Interesting approach. The mmap streaming idea is clever, but I'd love to see real-world benchmarks beyond TinyLlama — especially for the 140B claim. Running that on a Mac Mini with 16GB would be the real proof point.
For context, I run a Mac Mini M4 as a homelab server and the memory pressure from even 7B models is noticeable. Curious how this handles sustained inference without thermal throttling.
Running a Mac Mini M4 as a home server for a bunch of automation scripts right now. The mmap-based layer streaming is the part I'm most curious about -- how does latency look when you're streaming layers from disk mid-inference? I'd expect throughput to degrade sharply once you exceed unified memory, but maybe the Top-K sparsity masks enough of the weight accesses that it's not as bad as sequential streaming would be. What's the actual tokens/sec at 140B scale on the base Mac Mini config?
This is impressive. I've been experimenting with Gemini API for a side project and the latency difference between local and cloud inference is something I keep thinking about. How does memory usage scale with the 500B models?
I have a MacBook Pro M1 Max w/64 GB RAM, and a Mac Studio M3 Ultra w/96 GB RAM. What do you think is possible to run on these? Just curious before I really try it out.
Author here.
Thank you for all the good and curious comments.
For 72B models, around *36GB memory works fine* by the way. I ran the benchmark and shared the results on the website: https://opengraviton.github.io/index.html
While working on this research I realized something important: the way most current models are trained is extremely inefficient. Because of that, I started developing *graviton-native*, which trains AI models from scratch using more efficient architectures.
The idea is to design models that are optimized for efficiency from the beginning. My expectation is that this approach could bring around *~70% efficiency improvement*. Combined with OpenGraviton, I believe this could eventually make it possible to run *trillion-parameter scale models locally*.
You can find the paper here: https://opengraviton.github.io/paper.html
And the repository here: https://github.com/opengraviton/graviton-native
Right now I’m training a *72B model* using this approach. I’ll share the results soon and update the website.
The mmap layer streaming approach is smart for working around memory limits. In practice though, 1.58-bit ternary quantization tends to degrade quality noticeably on reasoning-heavy tasks compared to 4-bit — curious if you've measured perplexity deltas at the 140B scale.
Interesting approach. The mmap streaming idea is clever, but I'd love to see real-world benchmarks beyond TinyLlama — especially for the 140B claim. Running that on a Mac Mini with 16GB would be the real proof point.
For context, I run a Mac Mini M4 as a homelab server and the memory pressure from even 7B models is noticeable. Curious how this handles sustained inference without thermal throttling.
Running a Mac Mini M4 as a home server for a bunch of automation scripts right now. The mmap-based layer streaming is the part I'm most curious about -- how does latency look when you're streaming layers from disk mid-inference? I'd expect throughput to degrade sharply once you exceed unified memory, but maybe the Top-K sparsity masks enough of the weight accesses that it's not as bad as sequential streaming would be. What's the actual tokens/sec at 140B scale on the base Mac Mini config?
Yeah...
https://github.com/opengraviton/graviton?tab=readme-ov-file#...
the benchmarks don't show any results for using these larger-than-memory models, only the size difference
it all smells quite sloppy
What could find in the readme shows:
~19 tok/s for Apple M1 Max (64 GB) with TinyLlama-1.1B-Chat-v1.0
This is impressive. I've been experimenting with Gemini API for a side project and the latency difference between local and cloud inference is something I keep thinking about. How does memory usage scale with the 500B models?
Hi @fatihturker – exciting project if it works!
I have a MacBook Pro M1 Max w/64 GB RAM, and a Mac Studio M3 Ultra w/96 GB RAM. What do you think is possible to run on these? Just curious before I really try it out.
Fascinating. I don't understand the technical terms, but running a big coding agent locally is a dream of mine, so I thank you for your efforts!