Writing an LLM from scratch, part 20 – starting training, and cross entropy loss

(gilesthomas.com)

39 points | by gpjt 3 days ago ago

3 comments

There's more than one way to do self supervised training.

This is the approach the author has taken.

    Training corpus: "The fat cat sat on the mat"

    Input -> Label
    --------------
    "The" -> " fat"
    "The fat" -> " cat"
    "The fat cat" -> " sat"

Hugging Face's Trainer class takes a different approach. The label is same as input shifted left by 1 and padded by the <ignore> token (-1).

    Training corpus: "The fat cat sat on the mat"
    Input (7 tokens): "The fat cat sat on the mat"
    Output logit (7 tokens): "mat fat sat on fat mat and"
    Shifted label (7 tokens): "fat cat sat on the mat <ignore>"

Cross entropy is then calculated for the output logits and shifted label. At least this my understanding after reviewing the code.

[-]

blackbear_ 2 days ago

The two ways are equivalent (it's always next token prediction) but the latter is way more efficient as it computes the loss for N tokens in a single forward pass.

asimovDev 2 days ago

https://www.gilesthomas.com/2024/12/llm-from-scratch-1

part 1