Making Deep Learning Go Brrrr from First Principles

(horace.io)

25 points | by tosh an hour ago ago

11 comments

  • tosh an hour ago

    > in the time that Python can perform a single FLOP, an A100 could have chewed through 9.75 million FLOPS

    wild

    • rvz a few seconds ago

      So this is a prelude into what AI is going to do to people's minds, even to "experts" and this is a new low when one is comparing a language to a GPU.

      From this author to Dawkins, AI psychosis is real as ever and this year marks the top of this AI cycle.

    • patmorgan23 12 minutes ago

      Why are we comparing a programing language and a GPU. This is a category error. Programing languages do not do any operations. They perform no FLOPs, they are the thing the FLOPs are performing.

      "The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)

    • p1esk 31 minutes ago

      This statement makes zero sense

    • xyzsparetimexyz an hour ago

      Single core vs multi core accounts for much of this

      • cdavid 24 minutes ago

        Not really. GPU many cores, at least for fp32, gives you 2 to 4 order of magnitudes compared to high speed CPU.

        The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.

        See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.

  • jdw64 38 minutes ago

    Right now, all I know how to do is pull models from Hugging Face, but someday I want to build my own small LLM from scratch

    • glouwbug 3 minutes ago

      It’s just linear algebra. Work your way from feed forward to CNN to RNN to LSTM to attention then maybe a small inference engine. Kaparthy’s llama2.c is only ~300 lines on the latter and it pragma simds so you don’t need fancy GPUs

  • noosphr an hour ago

    >For example, getting good performance on a dataset with deep learning also involves a lot of guesswork. But, if your training loss is way lower than your test loss, you're in the "overfitting" regime, and you're wasting your time if you try to increase the capacity of your model.

    https://arxiv.org/abs/1912.02292

    • appplication an hour ago

      Generally, posting a link-only reply without further elaboration comes across as a bit rude. Are you providing support for the above point? Refuting it? You felt compelled to comment, a few words to indicate what you’re actually trying to say would go a long way.

      • noosphr 41 minutes ago

        >We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better.