10 comments

  • zxexz an hour ago

    I've seen some very impressive results just embedding a pre-trained KGE model into a transformer model, and letting it "learn" to query it (I've just used heterogenous loss functions during training with "classifier dimensions" that determine whether to greedily sample from the KGE sidecar, I'm sure there are much better ways of doing this.). This is just subjective viewpoint obviously, but I've played around quite a lot with this idea, and it's very easy to get a an "interactive" small LLM with stable results doing such a thing, the only problem I've found is _updating_ the knowledge cheaply without partially retraining the LLM itself. For small, domain-specific models this isn't really an issue though - for personal projects I just use a couple 3090s.

    I think this stuff will become a lot more fascinating after transformers have bottomed out on their hype curve and become a tool when building specific types of models.

    • aix1 an hour ago

      > embedding a pre-trained KGE model into a transformer model

      Do you have any good pointers (literature, code etc) on the mechanics of this?

      • zxexz 39 minutes ago

        Check out PyKEEN [0] and go wild. I like to train a bunch of random models and "overfit" them to the extreme (in my mind overfitting them is the point for this task, you want dense, compressed knowledge). Resize the input and output embeddings of an existing pretrained (but small) LLM (input only necessary if you're adding extra metadata on input, but make sure you untie input/output weights). You can add a linear layer extension to the transformer blocks, pass it up as some sort of residual, etc. - honestly just find a way to shove it in, detach the KGE from the computation graph and add something learnable between it and wherever you're connecting it - like just a couple linear layers and a ReLU. The output side is more important, you can have some indicator logit(s) to determine whether to "read" from the detached graph or sample the outputs of the LLM. Or just always do both and interpret it.

        (like tinyllama or smaller, or just use whatever karpathy repo is most fun at the moment and train some gpt2 equivalent)

        [0] https://pykeen.readthedocs.io/en/stable/index.html

        • zxexz 29 minutes ago

          Sorry if that was ridiculously vague. I don't know a ton about the state of the art, and I'm really not sure there is one - the papers just seem to get more terminology-dense and the research mostly just seems to end up developing new terminology. My grug-brained philosophy is just to make models small enough you can just shove things in and iterate quick enough in colab or a locally hosted notebook with access to a couple 3090s, or even just modern Ryzen/EPYC cores. I like to "evaluate" the raw model using pyro-ppl to do MCMC or SVI on the raw logits on a known holdout dataset.

          Really always happy to chat about this stuff, with anybody. Would love to explore ideas here, it's a fun hobby, and we're living in a golden age of open-source structured datasets. I haven't actually found a community interested specifically in static knowledge injection. Email in profile, in (ebg_13 encoded).

      • napsternxg 18 minutes ago

        We also did something similar in our NTULM paper at Twitter https://youtu.be/BjAmQjs0sZk?si=PBQyEGBx1MSkeUpX

        Used in non generative language models like BERT but should help with generative models as well.

        • zxexz 11 minutes ago

          Thanks for sharing! I'll give it a read tomorrow - I do not appear to have read this. I really do wish there were good places for randos like me to discuss this stuff casually. I'm in so many slack, discord, etc. channels but none of them have the same intensity and hyperfocus as certain IRC channels of yore.

  • isaacfrond an hour ago

    I think there is a philosophical angle to this. I mean, my world map was constructed by chance interactions with the real world. Does this mean that the my world map is a close to the real world map, as their NN's map is to Manhattan? Is my world map full of non-existent streets, exits that are at the wrong place, etc. The NN map of Manhattan works almost 100% correctly when used for normal navigation but breaks apart badly when it has to plan a detour. How brittle is my world map?

    • cen4 6 minutes ago

      Also things are not static in the real world.

  • fragmede 2 hours ago

    Wrong as it is, I'm impressed they were able to get any maps out of their LLM that look vaguely cohesive. The shortest path map has bits of streets downtown and around Central Park that aren't totally red, and Central Park itself is clear on all 3 maps.

    They used eight A100s, but don't say how long it took to train their LLM. It would be interesting to know the wall clock time they spent. Their dataset is, relatively speaking, tiny which means it should take fewer resources to replicate from scratch.

    What's interesting though is that the Smalley model performed better, though they don't speculate why that is.