I've seen some very impressive results just embedding a pre-trained KGE model into a transformer model, and letting it "learn" to query it (I've just used heterogenous loss functions during training with "classifier dimensions" that determine whether to greedily sample from the KGE sidecar, I'm sure there are much better ways of doing this.). This is just subjective viewpoint obviously, but I've played around quite a lot with this idea, and it's very easy to get a an "interactive" small LLM with stable results doing such a thing, the only problem I've found is _updating_ the knowledge cheaply without partially retraining the LLM itself. For small, domain-specific models this isn't really an issue though - for personal projects I just use a couple 3090s.
I think this stuff will become a lot more fascinating after transformers have bottomed out on their hype curve and become a tool when building specific types of models.
Check out PyKEEN [0] and go wild. I like to train a bunch of random models and "overfit" them to the extreme (in my mind overfitting them is the point for this task, you want dense, compressed knowledge). Resize the input and output embeddings of an existing pretrained (but small) LLM (input only necessary if you're adding extra metadata on input, but make sure you untie input/output weights). You can add a linear layer extension to the transformer blocks, pass it up as some sort of residual, etc. - honestly just find a way to shove it in, detach the KGE from the computation graph and add something learnable between it and wherever you're connecting it - like just a couple linear layers and a ReLU. The output side is more important, you can have some indicator logit(s) to determine whether to "read" from the detached graph or sample the outputs of the LLM. Or just always do both and interpret it.
(like tinyllama or smaller, or just use whatever karpathy repo is most fun at the moment and train some gpt2 equivalent)
Sorry if that was ridiculously vague. I don't know a ton about the state of the art, and I'm really not sure there is one - the papers just seem to get more terminology-dense and the research mostly just seems to end up developing new terminology. My grug-brained philosophy is just to make models small enough you can just shove things in and iterate quick enough in colab or a locally hosted notebook with access to a couple 3090s, or even just modern Ryzen/EPYC cores. I like to "evaluate" the raw model using pyro-ppl to do MCMC or SVI on the raw logits on a known holdout dataset.
Really always happy to chat about this stuff, with anybody. Would love to explore ideas here, it's a fun hobby, and we're living in a golden age of open-source structured datasets. I haven't actually found a community interested specifically in static knowledge injection. Email in profile, in (ebg_13 encoded).
Thanks for sharing! I'll give it a read tomorrow - I do not appear to have read this. I really do wish there were good places for randos like me to discuss this stuff casually. I'm in so many slack, discord, etc. channels but none of them have the same intensity and hyperfocus as certain IRC channels of yore.
I think there is a philosophical angle to this. I mean, my world map was constructed by chance interactions with the real world. Does this mean that the my world map is a close to the real world map, as their NN's map is to Manhattan? Is my world map full of non-existent streets, exits that are at the wrong place, etc. The NN map of Manhattan works almost 100% correctly when used for normal navigation but breaks apart badly when it has to plan a detour. How brittle is my world map?
Wrong as it is, I'm impressed they were able to get any maps out of their LLM that look vaguely cohesive. The shortest path map has bits of streets downtown and around Central Park that aren't totally red, and Central Park itself is clear on all 3 maps.
They used eight A100s, but don't say how long it took to train their LLM. It would be interesting to know the wall clock time they spent. Their dataset is, relatively speaking, tiny which means it should take fewer resources to replicate from scratch.
What's interesting though is that the Smalley model performed better, though they don't speculate why that is.
I can't imagine training took more than a day with 8 A100 even with that vocab size [0] (does lightning do implicit vocab extension maybe?) and a batch size of 1 [1] or 64 [2] or 4096 [3] (I have not trawled through the repo and other wordk enough to see what they are actually using in the paper, and let's be real - we've all copied random min/nano/whatever GPT forks and not bothered renaming stuff). They mentioned their dataset is 120 million tokens, which is miniscule by transformer standards. Even with a more graph-based model making it 10X+ longer to train, 1.20 billion tokens per epoch equivalent shouldn't take more than a couple hours with no optimization.
I've seen some very impressive results just embedding a pre-trained KGE model into a transformer model, and letting it "learn" to query it (I've just used heterogenous loss functions during training with "classifier dimensions" that determine whether to greedily sample from the KGE sidecar, I'm sure there are much better ways of doing this.). This is just subjective viewpoint obviously, but I've played around quite a lot with this idea, and it's very easy to get a an "interactive" small LLM with stable results doing such a thing, the only problem I've found is _updating_ the knowledge cheaply without partially retraining the LLM itself. For small, domain-specific models this isn't really an issue though - for personal projects I just use a couple 3090s.
I think this stuff will become a lot more fascinating after transformers have bottomed out on their hype curve and become a tool when building specific types of models.
> embedding a pre-trained KGE model into a transformer model
Do you have any good pointers (literature, code etc) on the mechanics of this?
Check out PyKEEN [0] and go wild. I like to train a bunch of random models and "overfit" them to the extreme (in my mind overfitting them is the point for this task, you want dense, compressed knowledge). Resize the input and output embeddings of an existing pretrained (but small) LLM (input only necessary if you're adding extra metadata on input, but make sure you untie input/output weights). You can add a linear layer extension to the transformer blocks, pass it up as some sort of residual, etc. - honestly just find a way to shove it in, detach the KGE from the computation graph and add something learnable between it and wherever you're connecting it - like just a couple linear layers and a ReLU. The output side is more important, you can have some indicator logit(s) to determine whether to "read" from the detached graph or sample the outputs of the LLM. Or just always do both and interpret it.
(like tinyllama or smaller, or just use whatever karpathy repo is most fun at the moment and train some gpt2 equivalent)
[0] https://pykeen.readthedocs.io/en/stable/index.html
Sorry if that was ridiculously vague. I don't know a ton about the state of the art, and I'm really not sure there is one - the papers just seem to get more terminology-dense and the research mostly just seems to end up developing new terminology. My grug-brained philosophy is just to make models small enough you can just shove things in and iterate quick enough in colab or a locally hosted notebook with access to a couple 3090s, or even just modern Ryzen/EPYC cores. I like to "evaluate" the raw model using pyro-ppl to do MCMC or SVI on the raw logits on a known holdout dataset.
Really always happy to chat about this stuff, with anybody. Would love to explore ideas here, it's a fun hobby, and we're living in a golden age of open-source structured datasets. I haven't actually found a community interested specifically in static knowledge injection. Email in profile, in (ebg_13 encoded).
We also did something similar in our NTULM paper at Twitter https://youtu.be/BjAmQjs0sZk?si=PBQyEGBx1MSkeUpX
Used in non generative language models like BERT but should help with generative models as well.
Thanks for sharing! I'll give it a read tomorrow - I do not appear to have read this. I really do wish there were good places for randos like me to discuss this stuff casually. I'm in so many slack, discord, etc. channels but none of them have the same intensity and hyperfocus as certain IRC channels of yore.
I think there is a philosophical angle to this. I mean, my world map was constructed by chance interactions with the real world. Does this mean that the my world map is a close to the real world map, as their NN's map is to Manhattan? Is my world map full of non-existent streets, exits that are at the wrong place, etc. The NN map of Manhattan works almost 100% correctly when used for normal navigation but breaks apart badly when it has to plan a detour. How brittle is my world map?
Also things are not static in the real world.
Wrong as it is, I'm impressed they were able to get any maps out of their LLM that look vaguely cohesive. The shortest path map has bits of streets downtown and around Central Park that aren't totally red, and Central Park itself is clear on all 3 maps.
They used eight A100s, but don't say how long it took to train their LLM. It would be interesting to know the wall clock time they spent. Their dataset is, relatively speaking, tiny which means it should take fewer resources to replicate from scratch.
What's interesting though is that the Smalley model performed better, though they don't speculate why that is.
I can't imagine training took more than a day with 8 A100 even with that vocab size [0] (does lightning do implicit vocab extension maybe?) and a batch size of 1 [1] or 64 [2] or 4096 [3] (I have not trawled through the repo and other wordk enough to see what they are actually using in the paper, and let's be real - we've all copied random min/nano/whatever GPT forks and not bothered renaming stuff). They mentioned their dataset is 120 million tokens, which is miniscule by transformer standards. Even with a more graph-based model making it 10X+ longer to train, 1.20 billion tokens per epoch equivalent shouldn't take more than a couple hours with no optimization.
[0] https://github.com/keyonvafa/world-model-evaluation/blob/949... [1] https://github.com/keyonvafa/world-model-evaluation/blob/949... [2] https://github.com/keyonvafa/world-model-evaluation/blob/949... [3] https://github.com/keyonvafa/world-model-evaluation/blob/mai...