3 comments

  • leopoldj 2 days ago

    There's more than one way to do self supervised training.

    This is the approach the author has taken.

        Training corpus: "The fat cat sat on the mat"
    
        Input -> Label
        --------------
        "The" -> " fat"
        "The fat" -> " cat"
        "The fat cat" -> " sat"
    
    
    Hugging Face's Trainer class takes a different approach. The label is same as input shifted left by 1 and padded by the <ignore> token (-1).

        Training corpus: "The fat cat sat on the mat"
        Input (7 tokens): "The fat cat sat on the mat"
        Output logit (7 tokens): "mat fat sat on fat mat and"
        Shifted label (7 tokens): "fat cat sat on the mat <ignore>"
    
    
    Cross entropy is then calculated for the output logits and shifted label. At least this my understanding after reviewing the code.
    • blackbear_ 2 days ago

      The two ways are equivalent (it's always next token prediction) but the latter is way more efficient as it computes the loss for N tokens in a single forward pass.

  • asimovDev 2 days ago