- Human brains are estimated to have a few hundred trillion synapses. If you tried to replicate this in a neural network model with one parameter per synapse, it would be much larger than the largest models in use today.
- Conventional wisdom in form of the Chinchilla scaling law suggests that to train such a gargantuan model, you would need an even more gargantuan training corpus.
- But no human has read anywhere near as much as even relatively small Chinchilla-optimal models. In fact, rather than acquiring as much data as possible as efficiently as possible, children might rather rewatch the exact same video for the umpteenth time. When they learn arithmetic, it's from just a paltry few examples provided by the teacher in school.
- Large neural networks trained on such little training data would quickly memorize it perfectly and overfit horribly.
- Individuals with photographic memory demonstrate that human brains indeed have the memorization capacity you would expect based on synapse count, and appear to show difficulties with generalization as a side-effect.
- Speculatively, typical humans forget and generalize instead of memorizing because synaptic strengths are reduced during sleep in an analogue to regularization by weight decay.
- Therefore, maybe we should train extremely large models on little data with extremely strong weight decay to counteract memorization, and hope a large learning rate will quickly "catapult" it to a generalizing solution.
What I'm missing is a discussion of how much this would cost, even if you handle deployment by distillation into smaller, faster, less data-efficient models.
>But no human has read anywhere near as much as even relatively small Chinchilla-optimal models
They're missing that humans don't consume raw text. They consume non-stop high resolution, high FPS audio and video imagery. If you tokenized the input to human eyes and ears in the first few years of life, that's more data than even the largest LLMs are trained on.
> Human brains are estimated to have a few hundred trillion synapses. If you tried to replicate this in a neural network model with one parameter per synapse...
Note that LLM parameters don't map to synapses in the same naive way they would for a fully connected network. Each attention parameter is applied thousands or millions of times to the inputs at each inference pass, so it's more like each param might code for a neural circuit repeated thousands of times.
I think of attention as a sort of convolution: in a NN, each convolution kernel gets applied repeatedly to all parts of an image, but in the human visual cortex I imagine these circuits are effectively all separate and parallel. The few parameters of a convolution kernel map to thousands of identical circuits in the visual cortex.
> Human brains do this by deep double descent-style overparameterization, and adopting a scaling strategy of extremely high-learning-rate training of extremely overparameterized models on small diverse highly-filtered datasets.
That’s an extremely steep claim with no source other than vibes. Last time I checked my biology notes, model parameters are neurons, and they cost a ton of energy to maintain. Your hypothesis is really far removed from any actual neuroscience. Also, where are those filtered datasets coming from? Do you think genetics hands them to us? There’s about zero evidence for this claim as well.
I like new concepts for ML research but please do not make up theories of human cognition when you clearly have no idea about it.
We have a lot of synapses, but (agreeing with you) I don't find that sufficient to explain why humans (or animals!) do what we do. If you throw zillions of parameters at a problem with a weak architecture, you get really high-fidelity memorization, and we're not awesome at memorization compared to machines.
Humans can do an impressive amount of generalization from one error or surprise, and as is often rightly noted, don't need trillions of words to get going. And it all seems to happen some 'forward-only' way, without backpropagation -- we don't have AdamW or MuonClip helpfully nudging our synaptic connections towards whatever would have scored well on our most recent test. It is relevant that we're creatures with goals -- reinforcement learning is the only stage where there's a taste of that for neural nets -- but the learning differences seem at least partly independent of that.
I suppose it could turn out that, even if not sufficient, the large number of synapses is necessary to all this, like we're effectively buying a lot of lottery tickets that give us a shot at fishing interesting hypotheses out of the experiences flowing by. But I'm still awfully suspicious that we don't have the right mathematical model for learning messy ideas all worked out yet.
There is actually a way to get really amazing sample efficiency out of a learning setup, and that's engineering in a load of appropriate inductive biases, which personally I am convinced evolution has done for us. Explains a big chunk of the "how are brains so sample efficient" problem really easy, but unfortunately without handing us an easy way to replicate it, which makes it unpopular. Also, it's something that we don't really want to do in the same way evolution has, as all those biases do even further reduce sample efficiency for all the things for which they are not appropriate.
Human brains were trained by evolution, a genetic algorithm that ran for billions of years and used the entire planet Earth as compute. Good luck competing with it using your puny corpus of texts.
An attempt at a summary of the argument:
- Human brains are estimated to have a few hundred trillion synapses. If you tried to replicate this in a neural network model with one parameter per synapse, it would be much larger than the largest models in use today.
- Conventional wisdom in form of the Chinchilla scaling law suggests that to train such a gargantuan model, you would need an even more gargantuan training corpus.
- But no human has read anywhere near as much as even relatively small Chinchilla-optimal models. In fact, rather than acquiring as much data as possible as efficiently as possible, children might rather rewatch the exact same video for the umpteenth time. When they learn arithmetic, it's from just a paltry few examples provided by the teacher in school.
- Large neural networks trained on such little training data would quickly memorize it perfectly and overfit horribly.
- Individuals with photographic memory demonstrate that human brains indeed have the memorization capacity you would expect based on synapse count, and appear to show difficulties with generalization as a side-effect.
- Speculatively, typical humans forget and generalize instead of memorizing because synaptic strengths are reduced during sleep in an analogue to regularization by weight decay.
- Therefore, maybe we should train extremely large models on little data with extremely strong weight decay to counteract memorization, and hope a large learning rate will quickly "catapult" it to a generalizing solution.
What I'm missing is a discussion of how much this would cost, even if you handle deployment by distillation into smaller, faster, less data-efficient models.
>But no human has read anywhere near as much as even relatively small Chinchilla-optimal models
They're missing that humans don't consume raw text. They consume non-stop high resolution, high FPS audio and video imagery. If you tokenized the input to human eyes and ears in the first few years of life, that's more data than even the largest LLMs are trained on.
> Human brains are estimated to have a few hundred trillion synapses. If you tried to replicate this in a neural network model with one parameter per synapse...
Note that LLM parameters don't map to synapses in the same naive way they would for a fully connected network. Each attention parameter is applied thousands or millions of times to the inputs at each inference pass, so it's more like each param might code for a neural circuit repeated thousands of times.
I think of attention as a sort of convolution: in a NN, each convolution kernel gets applied repeatedly to all parts of an image, but in the human visual cortex I imagine these circuits are effectively all separate and parallel. The few parameters of a convolution kernel map to thousands of identical circuits in the visual cortex.
> Human brains do this by deep double descent-style overparameterization, and adopting a scaling strategy of extremely high-learning-rate training of extremely overparameterized models on small diverse highly-filtered datasets.
That’s an extremely steep claim with no source other than vibes. Last time I checked my biology notes, model parameters are neurons, and they cost a ton of energy to maintain. Your hypothesis is really far removed from any actual neuroscience. Also, where are those filtered datasets coming from? Do you think genetics hands them to us? There’s about zero evidence for this claim as well. I like new concepts for ML research but please do not make up theories of human cognition when you clearly have no idea about it.
We have a lot of synapses, but (agreeing with you) I don't find that sufficient to explain why humans (or animals!) do what we do. If you throw zillions of parameters at a problem with a weak architecture, you get really high-fidelity memorization, and we're not awesome at memorization compared to machines.
Humans can do an impressive amount of generalization from one error or surprise, and as is often rightly noted, don't need trillions of words to get going. And it all seems to happen some 'forward-only' way, without backpropagation -- we don't have AdamW or MuonClip helpfully nudging our synaptic connections towards whatever would have scored well on our most recent test. It is relevant that we're creatures with goals -- reinforcement learning is the only stage where there's a taste of that for neural nets -- but the learning differences seem at least partly independent of that.
I suppose it could turn out that, even if not sufficient, the large number of synapses is necessary to all this, like we're effectively buying a lot of lottery tickets that give us a shot at fishing interesting hypotheses out of the experiences flowing by. But I'm still awfully suspicious that we don't have the right mathematical model for learning messy ideas all worked out yet.
There is actually a way to get really amazing sample efficiency out of a learning setup, and that's engineering in a load of appropriate inductive biases, which personally I am convinced evolution has done for us. Explains a big chunk of the "how are brains so sample efficient" problem really easy, but unfortunately without handing us an easy way to replicate it, which makes it unpopular. Also, it's something that we don't really want to do in the same way evolution has, as all those biases do even further reduce sample efficiency for all the things for which they are not appropriate.
> Speculative proposal
I guess at least they're honest about it? lol
Human brains were trained by evolution, a genetic algorithm that ran for billions of years and used the entire planet Earth as compute. Good luck competing with it using your puny corpus of texts.
No, I'm sorry, but there is no secret math formula that will allow you to overcome the lack of training data.