Bugs in LLM Training – Gradient Accumulation Fix

(unsloth.ai)

53 points | by apsec112 2 days ago ago

7 comments

  • imjonse 7 hours ago

    Same issue described on HF: https://huggingface.co/blog/gradient_accumulation

    It also highlights the main disadvantage of Transformers codebase using the copy-paste method for models, where this fix needs to be applied to every single model separately.

    • CraigJPerry an hour ago

      >> disadvantage of Transformers codebase using the copy-paste method for models, where this fix needs to be applied to every single model separately

      What are the best tools we have available for tackling this kind of large scale copy-paste change?

      https://github.com/huggingface/transformers/pull/34191/commi...

      This feels too complex to tackle with PyCharm structural find and replace, even a more powerful structural find and replace like https://comby.dev/ feels underpowered here.

      Sourcegraph batch changes? That solves broadcasting the change but doesn’t help with capturing the change to make.

      Open rewrite? The python implementation is early stages, not prod ready as I understand it. Plus this change is too complex to use refaster templates even if we could use orw so you’d be debugging a fairly involved method visitor which in this case is probably orders of magnitude more time consuming than just making the changes manually.

      What else is there that I don’t know about?

  • xcodevn 4 hours ago

    Look from a different point of view: this is a feature, not a bug. With this, every example has equal weight, while with the fix, every token has equal weight.

    • oergiR 3 hours ago

      That makes it sound like it’s a choice, which it isn’t really. The way to look at it is from a probabilistic perspective: with the fix, you maximise the probability of the data. Without the fix, you fairly arbitrarily raise some probabilities to a power greater than one, and some to a power less than one.

    • danielhanchen 4 hours ago

      Yes you're correct, but in normal full batch training without gradient accumulation, all tokens are weighted equally. Standard grad accum does not, and so the "fix" makes grad accum and full batch training finally mathematically equivalent

  • danielhanchen 7 hours ago

    Oh hey! :) TLDR naively gradient accumulation was over-weighting short sequence lengths in LLM finetuning and training runs, and under-weighting long sequence lengths.

    For eg a text with sequence lengths of [1, 100] would be scaled by 1/(100+1) in full batch training, but grad accum of 2 would weight [1] as 1/1 * 1/2 = 1/2, whilst [100] as 1/100 * 1/2 = 1/200. (1/2 since grad accum needs to divide by the # of grad accum steps)

    • ejddhbrbrrnrn a minute ago

      Is this a general issue rather than unsloth specific. How wide is this problem? Sounds wild if it has been affecting everyones training.