The deep learning boom caught almost everyone by surprise

(understandingai.org)

306 points | by slyall 2 months ago ago

191 comments

  • aithrowawaycomm 2 months ago

    I think there is a slight disconnect here between making AI systems which are smart and AI systems which are useful. It’s a very old fallacy in AI: pretending tools which assist human intelligence by solving human problems must themselves be intelligent.

    The utility of big datasets was indeed surprising, but that skepticism came about from recognizing the scaling paradigm must be a dead end: vertebrates across the board require less data to learn new things, by several orders of magnitude. Methods to give ANNs “common sense” are essentially identical to the old LISP expert systems: hard-wiring the answers to specific common-sense questions in either code or training data, even though fish and lizards can rapidly make common-sense deductions about manmade objects they couldn’t have possibly seen in their evolutionary histories. Even spiders have generalization abilities seemingly absent in transformers: they spin webs inside human homes with unnatural geometry.

    Again it is surprising that the ImageNet stuff worked as well as it did. Deep learning is undoubtedly a useful way to build applications, just like Lisp was. But I think we are about as close to AGI as we were in the 80s, since we have made zero progress on common sense: in the 80s we knew Big Data can poorly emulate common sense, and that’s where we’re at today.

    • j_bum 2 months ago

      > vertebrates across the board require less data to learn new things, by several orders of magnitude.

      Sometimes I wonder if it’s fair to say this.

      Organisms have had billions of years of training. We might come online and succeed in our environments with very little data, but we can’t ignore the information that’s been trained into our DNA, so to speak.

      What’s billions of years of sensory information that drove behavior and selection, if not training data?

      • aithrowawaycomm 2 months ago

        My primary concern is the generalization to manmade things that couldn’t possibly be in the evolutionary “training data.” As a thought experiment, it seems very plausible that you can train a transformer ANN on spiderwebs between trees, rocks, bushes, etc, and get “superspider” performance (say in a computer simulation). But I strongly doubt this will generalize to building webs between garages and pantries like actual spiders, no matter how many trees you throw at it, so such a system wouldn’t be ASI.

        This extends to all sorts of animal cognitive experiments: crows understand simple pulleys simply by inspecting them, but they couldn’t have evolved to use pulleys. Mice can quickly learn that hitting a button 5 times will give them a treat: does it make sense to say that they encountered a similar situation in their evolutionary past? It makes more sense to suppose that mice and crows have powerful abilities to reason causally about their actions. These abilities are more sophisticated than mere “Pavlovian” associative reasoning, which is about understanding stimuli. With AI we can emulate associative reasoning very well because we have a good mathematical framework for Pavlovian responses as a sort of learning of correlations. But causal reasoning is much more mysterious, and we are very far from figuring out a good mathematical formalism that a computer can make sense of.

        I also just detest the evolution = training data metaphor because it completely ignores architecture. Evolution is not just glomming on data, it’s trying different types of neurons, different connections between them, etc. All organisms alive today evolved with “billions of years of training,” but only architecture explains why we are so much smarter than chimps. In fact I think the “evolution” preys on our misconception that humans are “more evolved” than chimps, but our common ancestor was more primitive than a chimp.

        • visarga 2 months ago

          I don't think "humans/animals learn faster" holds. LLMs learn new things on the spot, you just explain it in the prompt and give an example or two.

          A recent paper tested both linguists and LLMs at learning a language with less than 200 speakers and therefore virtually no presence on the web. All from a few pages of explanations. The LLMs come close to humans.

          https://arxiv.org/abs/2309.16575

          Another example is the ARC-AGI benchmark, where the model has to learn from a few examples to derive the rule. AI models are closing the gap to human level, they are around 55% while humans are at 80%. These tests were specifically designed to be hard for models and easy for humans.

          Besides these examples of fast learning, I think the other argument about humans benefiting from evolution is also essential here. Similarly, we can't beat AlphaZero at Go, as it evolved its own Go culture and plays better than us. Evolution is powerful.

        • car 2 months ago

          It’s all in the architecture. Also, biological neurons are orders of magnitude more complex than NN’s. There’s a plethora of neurotransmitters and all kinds of cellular machinery for dealing with signals (inhibitory, excitatory etc.).

          • datameta 2 months ago

            Right - there is more inherent non-linearity in the fundamental unit of our architecture which leads to higher possible information complexity.

        • YeGoblynQueenne 2 months ago

          >> But causal reasoning is much more mysterious, and we are very far from figuring out a good mathematical formalism that a computer can make sense of.

          I agree with everything else you've said to a surprising degree (if I say the same things myself down the line I swear I'm not plagiarising you) but the above statement is not right: we absolutely know how to do deductive reasoning from data. We have powerful deductive inference approaches: search and reasoning algorithms, Resolution the major among them.

          What we don't have is a way to use those algorithms without a formal language or a structured object in which to denote the inputs and outputs. E.g. with Resolution you need logic formulae in clausal form, for search you need a graph etc. Animals don't need that and can reason from raw sensory data.

          Anyway we know how to do reasoning, not just learning; but the result of my doctoral research is that both are really one and what statistical machine learning is missing is a bridge between the two.

        • myownpetard 2 months ago

          Evolution is the heuristic search for effective neural architectures. It is training data, but for the meta-search for effective architectures, which gets encoded in our DNA.

          Then we compile and run that source code and our individual lived experience is the training data for the instantiation of that architecture, e.g. our brain.

          It's two different but interrelated training/optimization processes.

      • outworlder 2 months ago

        Difficult to compare, not only neurons are vastly more complex, but the neural networks change and adapt. That's like if GPUs were not only programmed by software, but the hardware could also be changed based on the training data (like more sophisticated FPGAs).

        Our DNA also stores a lot of information, but it is not that much.

        Our dogs can learn about things such as vehicles that they have not been exposed to nearly enough, evolution wide. And so do crows, using cars to crack nuts and then waiting for red lights. And that's completely unsupervised.

        We have a long way to go.

        • klipt 2 months ago

          You say "unsupervised" but crows are learning with feedback from the physical world.

          Young crows certainly learn: hitting objects is painful. Avoiding objects avoids the pain.

          From there, learning that red lights correlates with the large, fast, dangerous object stopping, is just a matter of observation.

          • RaftPeople 2 months ago

            > From there, learning that red lights correlates with the large, fast, dangerous object stopping, is just a matter of observation

            I think "just a matter of observation" understates the many levels of abstraction and generalization that animal brains have evolved to effectively deal with the environment.

            Here's something I just read the other day about this:

            Summary: https://medicalxpress.com/news/2024-11-neuroscientists-revea...

            Actual: https://www.nature.com/articles/s41586-024-08145-x

            "After experiencing enough sequences, the mice did something remarkable—they guessed a part of the sequence they had never experienced before. When reaching D in a new location for the first time, they knew to go straight back to A. This action couldn't have been remembered, since it was never experienced in the first place! Instead, it's evidence that mice know the general structure of the task and can track their 'position' in behavioral coordinates"

      • RaftPeople 2 months ago

        > Organisms have had billions of years of training. We might come online and succeed in our environments with very little data, but we can’t ignore the information that’s been trained into our DNA, so to speak

        It's not just information (e.g. sets of innate smells and response tendencies), but it's also all of the advanced functions built into our brains (e.g. making sense of different types of input, dynamically adapting the brain to conditions, etc.).

      • lubujackson 2 months ago

        Good point. And don't forget the dynamically changing environment responding with a quick death for any false path.

        Like how good would LLMs be if their training set was built by humans responding with an intelligent signal at every crossroads.

      • marcosdumay 2 months ago

        > but we can’t ignore the information that’s been trained into our DNA

        There's around 600MB in our DNA. Subtract this from the size of any LLM out there and see how much you get.

        • myownpetard 2 months ago

          A more fair comparison would be subtract it from the size the of source code required to represent the LLM.

          • nick3443 2 months ago

            More like the source code AND the complete design for a 200+ degree of freedom robot with batteries etc. pretty amazing.

            It's like a 600mb demoscene demo for Conway's game of life!

            • Terr_ 2 months ago

              That's underselling the product, a swarm of nanobots that are (literally, currently) beyond human understanding that are also the only way to construct certain materials and systems.

              Inheritor of the Gray Goo apocalypse that covered the planet, this kind constructs an enormous mobile mega-fortress with a literal hive-mind, scouring the environment for raw materials and fending off hacking attempts by other nanobots. They even simulate other hive-minds to gain an advantage.

          • marcosdumay 2 months ago

            The source code is the weights. That's what they learn.

            • myownpetard 2 months ago

              I disagree. A neural network is not learning it's source code. The source code specifies the model structure and hyperparameters. Then it compiled and instantiated into some physical medium, usually a bunch of GPUs, and weights are learned.

              Our DNA specifies the model structure and hyperparameters for our brains. Then it is compiled and instantiated into a physical medium, our bodies, and our connectome is trained.

              If you want to make a comparison about the quantity of information contained in different components of an artificial and a biological system, then it only makes sense if you compare apples to apples. DNA:Code :: Connectome:Weights

      • Salgat 2 months ago

        When you say billions of years, you have to remember that change in DNA is glacial compared to computing; we're talking the equivalent of years or even decades for a single training iteration to occur. Deep learning models on the other hand experience millions of these in a matter of a month, and each iteration is exposed to what would take a human thousands of lifetimes to be exposed to.

        • ab5tract 2 months ago

          DNA literally changes inside of a human within a single lifetime.

          It didn't take a thousand years for moths to turn grey during the industrial revolution.

          • Salgat 2 months ago

            Remember we're talking about the human race (and its ancestors) as a whole adopting the mutations that are successful.

      • loa_in_ 2 months ago

        I also think this is a lazy claim. We have so so many internal sources of information like the feeling of temperature or vestibular system reacting to anything from an inclination change to effective power output of heart in real time every second of the day.

        • j_bum 2 months ago

          That’s a fair point. But to push back, how many sources of sensory information are needed for cognition to arise in humans?

          I would be willing to bet that hearing or vision alone would be sufficient to develop cognition. Many of these extra senses are beneficial for survival, but not required for cognition. E.g., we don’t need smell/touch/taste/pain to think.

          Thoughts?

          • krschacht a month ago

            I think we need the other senses for cognition. The other senses are part of the reward function which the cognitive learning algorithms optimize for. Pleasure and pain, and joy and suffering, guide the cognitive development process.

            • j_bum a month ago

              I think you’re starting to conflate emotion with senses.

              Yes pain is a form of sensory experience, but it also has affective/emotional components that can be experienced even without the presence of noxious stimuli.

              However, there are people that don’t experience pain (congenital insensitivity to pain), which is caused by mutations in the NaV1.7 channel, or in one or more of the thermo/chemo/mechanotransducers that encode noxious stimuli into neural activity.

              And obviously, these people who don’t experience the sensory discriminative components of pain are still capable of cognition.

              To steelman your argument, I do agree that lacking all but one of what I would call the sufficient senses for cognition would dramatically slow down the rate of cognitive development. But I don’t think they would prohibit it.

      • SiempreViernes 2 months ago

        This argument mostly just hollows out the meaning of training: evolution gives you things like arms and ears, but if you say evolution is like training you imply that you could have grown a new kind of arm in school.

        • horsawlarway 2 months ago

          Training an LLM feels almost exactly like evolution - the gradient is "ability to procreate" and we're selecting candidates from related, randomized genetic traits and iterating the process over and over and over.

          Schooling/education feels much more like supervised training and reinforcement (and possibly just context).

          I think it's dismissive to assume that evolution hasn't influenced how well you're able to pick up new behavior, because it's highly likely it's not entirely novel in the context of your ancestry, and the traits you have that have been selected for.

      • 2 months ago
        [deleted]
      • YeGoblynQueenne 2 months ago

        >> Organisms have had billions of years of training.

        You're referring to evolution but evolution is not optimising an objective function over a large set of data (labelled, too). Evolution proceeds by random mutation. And just because an ancestral form has encountered e.g. ice and knows what that is, doesn't mean that its evolutionary descendants retain the memory of ice and know what that is because of that memory.

        tl;dr evolution and machine learning are radically different processes and it doesn't make a lot of sense to say that organisms have "trained" for millions of years. They haven't! They've evolved for millions of years.

        >> What’s billions of years of sensory information that drove behavior and selection, if not training data?

        That's not how it works: organisms don't train on data. They adapt to environments. Very different things.

    • spencerchubb 2 months ago

      > vertebrates across the board require less data to learn new things

      the human brain is absolutely inundated with data, especially from visual, audio, and kinesthetic mediums. the data is a very different form than what one would use to train a CNN or LLM, but it is undoubtedly data. newborns start out literally being unable to see, and they have to develop those neural pathways by taking in the "pixels" of the world for every millisecond of every day

    • kirkules 2 months ago

      Do you have, offhand, any names or references to point me toward why you think fish and lizards can make rapid common sense deductions about man made objects they couldn't have seen in their evolutionary histories?

      Also, separately, I'm only assuming but it seems the reason you think these deductions are different from hard wired answers if that their evolutionary lineage can't have had to make similar deductions. If that's your reasoning, it makes me wonder if you're using a systematic description of decisions and of the requisite data and reasoning systems to make those decisions, which would be interesting to me.

    • aleph_minus_one 2 months ago

      > I think there is a slight disconnect here between making AI systems which are smart and AI systems which are useful. It’s a very old fallacy in AI: pretending tools which assist human intelligence by solving human problems must themselves be intelligent.

      I have difficulties understanding why you could even believe in such a fallacy: just look around you: most jobs that have to be done require barely any intelligence, and on the other hand, there exist few jobs that do require an insane amount of intelligence.

    • rjsw 2 months ago

      Maybe we just collectively decided that it didn't matter whether the answer was correct or not.

      • aithrowawaycomm 2 months ago

        Again I do think these things have utility and the unreliability of LLMs is a bit incidental here. Symbolic systems in LISP are highly reliable, but they couldn’t possibly be extended to AGI without another component, since there was no way to get the humans out of the loop: someone had to assign the symbols semantic meaning and encode the LISP function accordingly. I think there’s a similar conceptual issue with current ANNs, and LLMs in particular: they rely on far too much formal human knowledge to get off the ground.

        • rjsw 2 months ago

          I meant more why the "boom caught almost everyone by surprise", people working in the field thought that correct answers would be important.

        • nxobject 2 months ago

          Barring a stunning discovery that will stop putting the responsibility for NN intelligence on synthetic training set – it looks like NN and symbolic AI may have to coexist, symbiotically.

  • gregw2 2 months ago

    The article credits two academics (Hinton, Fei Fei Li) and a CEO (Jensen Huang). But really it was three academics.

    Jensen Huang, reasonably, was desperate for any market that could suck up more compute, which he could pivot to from GPUs for gaming when gaming saturated its ability to use compute. Screen resolutions and visible polygons and texture maps only demand so much compute; it's an S-curve like everything else. So from a marketing/market-development and capital investment perspective I do think he deserves credit. Certainly the Intel guys struggled to similarly recognize it (and to execute even on plain GPUs.)

    But... the technical/academic insight of the CUDA/GPU vision in my view came from Ian Buck's "Brook" PhD thesis at Stanford under Pat Hanrahan (Pixar+Tableau co-founder, Turing Award Winner) and Ian promptly took it to Nvidia where it was commercialized under Jensen.

    For a good telling of this under-told story, see one of Hanrahan's lectures at MIT: https://www.youtube.com/watch?v=Dk4fvqaOqv4

    Corrections welcome.

    • markhahn 2 months ago

      Jensen embraced AI as a way to recover TAM after ASICs took over crypto mining. You can see that between-period in NVidia revenue and profit graphs.

      By that time, GP-GPU had been around for a long, long time. CUDA still doesn't have much to do with AI - sure, it supports AI usage, even includes some AI-specific features (low-mixed precision blocked operations).

      • cameldrv 2 months ago

        Jensen embraced AI way before that. CuDNN was released back in 2014. I remember being at ICLR in 2015, and there were three companies with booths: Google and Facebook who were recruiting, and NVIDIA was selling a 4 GPU desktop computer.

        • esjeon 2 months ago

          No. At the time, it was about GPGPU, not AI.

        • dartos 2 months ago

          Well as soon as matmul has a marketable use (ML predictive algorithms) nvidia was on top of it.

          I don’t think they were thinking of LLMs in 2014, tbf.

          • ghc 2 months ago

            I invested in an LLM company in 2014 and nvidia was very aggressive in giving GPUs for use in training models. Their program wasn't targeted specifically at LLMs, but they were definitely aware of the uses.

            In case anyone is confused, here's an explanatory tweet about one of the founders of that company: https://x.com/jxmnop/status/1725949517940294055 . LLMs have been in the works for awhile. But you need lots of money to make them good! Even back in 2014 they were mining reddit for training data.

          • throwaway314155 2 months ago

            Effectively no one was but LLM's are precisely "ML predictive algorithms". That neural networks more broadly would scale at all on gaming chips is plenty foresight to be impressed with.

      • aleph_minus_one 2 months ago

        > Jensen embraced AI as a way to recover TAM after ASICs took over crypto mining.

        TAM: Total Addressable Market

        • samarthr1 2 months ago

          Thanks, that was a TIL moment

      • latchkey 2 months ago

        ASIC's never took over mining ethereum because the algo was memory hard and producing ASIC's wasn't as profitable as just throwing GPUs at the problem...

        https://www.vijaypradeep.com/blog/2017-04-28-ethereums-memor...

        At the peak, there were around 18-25m GPUs deployed worldwide.

        Source: I mined with 150k AMD GPUs.

    • a-dub 2 months ago

      that's what i remember. i remember reading an academic paper about a cool hack where someone was getting the shaders in gpus to do massively parallel general purpose vector ops. it was this massive orders of magnitude scaling that enabled neural networks to jump out of obscurity and into the limelight.

      i remember prior to that, support vectors and rkhs were the hotness for continuous signal style ml tasks. they weren't particularly scalable and transfer learning formulations seemed quite complicated. (they were, however, pretty good for demos and contests)

      • sigmoid10 2 months ago

        You're probably thinking of this paper: https://ui.adsabs.harvard.edu/abs/2004PatRe..37.1311O/abstra...

        They were running a massive neural network (by the standards back then) on a GPU years before CUDA even existed. Even funnier, they demoed it on ATI cards. But it still took until 2012 and AlexNet making heavy use of CUDA's simpler interface before the Deep Learning hype started to take off outside purely academic playgrounds.

        So the insight neither came from Jensen nor the other authors mentioned above, but they were the first ones to capitalise on it.

  • DeathArrow 2 months ago

    I think neural nets are just a subset of machine learning techniques.

    I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.

    I don't say that transformers, LLMs, deep learning and other great things that happened in the neural network space aren't very valuable, because they are.

    But I think in the future we should also study other options which might be better suited than neural networks for some classes of problems.

    Can a very large and expensive LLM do sentiment analysis or classification? Yes, it can. But so can simple SVMs and KNN and sometimes even better.

    I saw some YouTube coders doing calls to OpenAI's o1 model for some very simple classification tasks. That isn't the best tool for the job.

    • jasode 2 months ago

      >I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.

      But that's backwards from how new techniques and progress is made. What actually happens is somebody (maybe a student at a university) has an insight or new idea for an algorithm that's near $0 cost to implement a proof-of concept. Then everybody else notices the improvement and then extra millions/billions get directed toward it.

      New ideas -- that didn't cost much at the start -- ATTRACT the follow on billions in investments.

      This timeline of tech progress in computer science is the opposite from other disciplines such as materials science or bio-medical fields. Trying to discover the next super-alloy or cancer drug all requires expensive experiments. Manipulating atoms & molecules requires very expensive specialized equipment. In contrast, computer science experiments can be cheap. You just need a clever insight.

      An example of that was the 2012 AlexNet image recognition algorithm that blew all the other approaches out of the water. Alex Krizhevsky had an new insight on a convolutional neural network to run on CUDA. He bought 2 NVIDIA cards (GTX580 3GB GPU) from Amazon. It didn't require NASA levels of investment at the start to implement his idea. Once everybody else noticed his superior results, the billions began pouring in to iterate/refine on CNNs.

      Both the "attention mechanism" and the refinement of "transformer architecture" were also cheap to prove out at a very small scale. In 2014, Jakob Uszkoreit thought about an "attention mechanism" instead of RNN and LSTM for machine translation. It didn't cost billions to come up with that idea. Yes, ChatGPT-the-product cost billions but the "attention mechanism algorithm" did not.

      >into SVMs, random forests, KNN, etc.

      If anyone has found an unknown insight into SVM, KNN, etc that everybody else in the industry has overlooked, they can do cheap experiments to prove it. E.g. The entire Wikipedia text download is currently only ~25GB. Run the new SVM classification idea on that corpus. Very low cost experiments in computer science algorithms can still be done in the proverbial "home garage".

      • FrustratedMonky 2 months ago

        "$0 cost to implement a proof-of concept"

        This falls apart for breakthroughs that are not zero cost to do a proof-of concept.

        Think that is what the parent is rereferring . That other technologies might have more potential, but would take money to build out.

      • scotty79 2 months ago

        Do transformer architecture and attention mechanisms actually give any benefit to anything else than scalability?

        I though the main insights were embeddings, positional encoding and shortcuts through layers to improve back propagation.

        • valzam 2 months ago

          When it comes to ML there is no such distinction though. Bigger models == more capable models and for bigger models you need scalability of the algorithm. It's like asking if going to 2nm fabs has any benefit other than putting more transistors in a chip. It's the entire point.

      • DeathArrow 2 months ago

        True, you might not need lots of money to test some ideas. But LLMs and transformers are all the rage so they gather all attention and research funds.

        People don't even think of doing anything else and those that might do, are paid to pursue research on LLMs.

    • edude03 2 months ago

      Transformers were made for machine translation - someone had the insight that when going from one language to another the context mattered such that the tokens that came before would bias which ones came after. It just so happened that transformers we more performant on other tasks, and at the time you could demonstrate the improvement on a small scale.

    • trhway 2 months ago

      >I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.

      people did that to horses. No car resulted from it, just slightly better horses.

      >I saw some YouTube coders doing calls to OpenAI's o1 model for some very simple classification tasks. That isn't the best tool for the job.

      This "not best tool" is just there for the coders to call while the "simple SVMs and KNN" would require coding and training by those coders for the specific task they have at hand.

      • guappa 2 months ago

        [citation needed]

    • dr_dshiv 2 months ago

      The best tool for the job is, I’d argue, the one that does the job most reliably for the least amount of money. When you consider how little expertise or data you need to use openai offerings, I’d be surprised if sentiment analysis using classical ML methods are actually better (unless you are an expert and have a good dataset).

    • mentalgear 2 months ago

      KANs (Kolmogorov-Arnold Networks) are one example of a promising exploration pathway to real AGI, with the advantage of full explain-ability.

      • astrange 2 months ago

        "Explainable" is a strong word.

        As a simple example, if you ask a question and part of the answer is directly quoted from a book from memory, that text is not computed/reasoned by the AI and so doesn't have an "explanation".

        But I also suspect that any AGI would necessarily produce answers it can't explain. That's called intuition.

        • diffeomorphism 2 months ago

          Why? If I ask you what the height of the Empire State Building is, then a reference is a great, explainable answer.

          • astrange 2 months ago

            It wouldn't be a reference; "explanation" for an LLM means it tells you which of its neurons were used to create the answer, ie what internal computations it did and which parts of the input it read. Their architecture isn't capable of referencing things.

            What you'd get is an explanation saying "it quoted this verbatim", or possibly "the top neuron is used to output the word 'State' after the word 'Empire'".

            You can try out a system here: https://monitor.transluce.org/dashboard/chat

            Of course the AI could incorporate web search, but then what if the explanation is just "it did a web search and that was the first result"? It seems pretty difficult to recursively make every external tool also explainable…

            • diffeomorphism 2 months ago

              Then you should have a stronger notion of "explanation". Why were these specific neurons activated?

              Simplest example: OCR. A network identifying digits can often be explained as recognizing lines, curves, numbers of segments etc.. That is an explanation, not "computer says it looks like an 8"

              • krisoft 2 months ago

                But can humans do that? If you show someone a picture of a cat, can they "explain" why is it a cat and not a dog or a pumpkin?

                And is that explanation the way how they obtained the "cat-nes" of the picture, or do they just see that it is a cat immediately and obviously and when you ask them for an explanation they come up with some explaining noises until you are satisfied?

                • diffeomorphism 2 months ago

                  Wild cat, house cat, lynx,...? Sure, they can. They will tell you about proportions, shape of the ears, size as compared to other objects in the picture etc.

                  For cat vs pumpkin they will think you are making fun of them, but it very much is explainable. Though now I am picturing a puzzle about finding orange cats in a picture of a pumpkin field.

                  • krisoft 2 months ago

                    > They will tell you about proportions, shape of the ears, size as compared to other objects in the picture etc.

                    But is that how they know that the image is a cat, or is that some after the fact tacked on explaining?

                    Let me tell you an example to better explain what I mean. There are these “botanical identifying” books. You take a speciment unknown to you and and it asks questions like “what shape the leaves are?” “Is the stem woody or not?” “How many petals on the flower?” And it leads you through a process and at the end gives you ideally the specific latin name of the species. (Or at least narrows it down.)

                    Vs the act of looking at a rose and knowing without having to expend any further energy that it is a rose. And then if someone is questioning you you can spend some energy on counting petals, and describing leaf shapes and find the thorns and point them out and etc.

                    It sounds like most people who want “explainable AI” want the first kind of thing. The blind and amnesiac botanist with the plant identifying book. Vs what humans are actually doing which is more like a classification model with a tacked on bulshit generator to reason about the classification model’s outputs into which it doesn’t actually have any in-depth insight.

                    And it gets worse the deeper you ask them. How do you know that is an ear? How do you know its shape? How do you know the animal is furry?

                • fragmede 2 months ago

                  Shown a picture of a cloud, why it looks like a cat does sometimes need an explanation until others can see the cat, and it's not just "explaining noises".

            • Retric 2 months ago

              LLM’s are not the only possible option here. When talking about AGI none of what we are doing is currently that promising.

              The search is for something that can write an essay, drive a car, and cook lunch so we need something new.

              • Vampiero 2 months ago

                When people talk about explainability I immediately think of Prolog.

                A Prolog query is explainable precisely because, by construction, it itself is the explanation. And you can go step by step and understand how you got a particular result, inspecting each variable binding and predicate call site in the process.

                Despite all the billions being thrown at modern ML, no one has managed to create a model that does something like what Prolog does with its simple recursive backtracking.

                So the moral of the story is that you can 100% trust the result of a Prolog query, but you can't ever trust the output of an LLM. Given that, which technology would you rather use to build software on which lives depend on?

                And which of the two methods is more "artificially intelligent"?

      • yathaid 2 months ago

        Neural networks can encode any computable function.

        KANs have no advantage in terms of computability. Why are they a promising pathway?

        Also, the splines in KANs are no more "explainable" than the matrix weights. Sure, we can assign importance to a node, but so what? It has no more meaning than anything else.

    • empiko 2 months ago

      Deep learning is easy to adapt to various domains, use cases, training criteria. Other approaches do not have the flexibility of combining arbitrary layers and subnetworks and then training them with arbitrary loss functions. The depth in deep learning is also pretty important, as it allows the model to create hierarchical representations of the inputs.

      • f1shy 2 months ago

        But is very hard to validate for important or critical applications

    • jensgk 2 months ago

      > I wonder what would have happened if we poured the same amount of money, talent and hardware into SVMs, random forests, KNN, etc.

      From my perspective, that is actually what happened between the mid-90s to 2015. Neural netowrks were dead in that period, but any other ML method was very, very hot.

    • Meloniko 2 months ago

      And based on what though do you think that?

      I think neural networks are fundamental and we will focus/experiment a lot more with architecture, layers and other parts involved but emerging features arise through size

    • netdevnet 2 months ago

      You are supposed to call it AI now. The word "machine learning" is for GOFAI 2nd gen only. Once all investors have been money drained and the next AI winter begins, then you will be allowed to call it Machine Learning

    • f1shy 2 months ago

      > neural nets are just a subset of machine learning techniques.

      Fact by definition

    • ldjkfkdsjnv 2 months ago

      This is such a terrible opinion, im so tired of reading the LLM deniers

  • kleiba 2 months ago

    > “Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”

    That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.

    • evrydayhustling 2 months ago

      Not baloney. The culture around data in 2005-2010 -- at least / especially in academia -- was night and day to where it is today. It's not that people didn't understand that more data enabled richer + more accurate models, but that they accepted data constraints as a part of the problem setup.

      Most methods research went into ways of building beliefs about a domain into models as biases, so that they could be more accurate in practice with less data. (This describes a lot of PGM work). This was partly because there was still a tug of war between CS and traditional statistics communities on ML, and the latter were trained to be obsessive about model specification.

      One result was that the models that were practical for production inference were often trained to the point of diminishing returns on their specific tasks. Engineers deploying ML weren't wishing for more training instances, but better data at inference time. Models that could perform more general tasks -- like differentiating 90k object classes rather than just a few -- were barely even on most people's radar.

      Perhaps folks at Google or FB at the time have a different perspective. One of the reasons I went ABD in my program was that it felt industry had access to richer data streams than academia. Fei Fei Li's insistence on building an academic computer science career around giant data sets really was ingenius, and even subversive.

      • bsenftner 2 months ago

        The culture was and is skeptical in biased manners. Between '04 and '08 I worked with a group that had trained neural nets for 3D reconstruction of human heads. They were using it for prenatal diagnostics and a facial recognition pre-processor, and I was using it for creating digital doubles in VFX film making. By '08 I'd developed a system suitable for use in mobile advertising, creating ads with people in them, and 3D games with your likeness as the player. VCs thought we were frauds, and their tech advisors told them our tech was an old discredited technique that could not do what we claimed. We spoke to every VC, some of which literally kicked us out. Finally, after years of "no" that same AlexNet success begins to change minds, but now they want the tech to create porn. At that point, after years of "no" I was making children's educational media, there was no way I was gonna do porn. Plus, president of my co was a woman, famous for creating children's media. Yeah, the culture was different then, not too long ago.

        • philipkglass 2 months ago

          Who's offering VC money for neural network porn technology? As far as I can tell, there is huge organic demand for this but prospective users are mostly cheapskates and the area is rife with reputational problems, app store barriers, payment processor barriers, and regulatory barriers. In practice I have only ever seen investors scared off by hints that a technology/platform would be well matched to adult entertainment.

        • evrydayhustling 2 months ago

          Wow, so early for generative -- although I assume you were generating parameters that got mapped to mesh positions, rather than generating pixels?

          I definitely remember that bias about neural nets, to the point of my first grad ML class having us recreate proofs that you should never need more than two hidden layers (one can pick up the thread at [1]). Of all the ideas clunking around in the AI toolbox at the time, I don't really have background on why people felt the need to kill NN with fire.

          [1] https://en.wikipedia.org/wiki/Universal_approximation_theore...

          • bsenftner 2 months ago

            It was annotated face images and 3D scans of heads trained to map one to the other. After a threshold in the size of the training data, good to great results from a single photo could be had to generate the mesh 3D positions, and then again to map the photo onto the mesh surface. Do that with multiple frames, and one is firmly in the Uncanny Valley.

      • tucnak 2 months ago

        > they accepted data constraints as a part of the problem setup.

        I've never heard this be put so succinctly! Thank you

      • rramadass 2 months ago

        Very well said !

    • sgt101 2 months ago

      It's not quite so - we couldn't handle it, and we didn't have it, so it was a bit of a none question.

      I started with ML in 1994, I was in a small poor lab - so we didn't have state of the art hardware. On the other hand I think my experience is fairly representative. We worked with data sets on spark workstations that were stored in flat files and had thousands or sometimes tens of thousands of instances. We had problems keeping our data sets on the machines and often archived them to tape.

      Data came from very deliberate acquisition processes. For example I remember going to a field exercise with a particular device and directing it's use over a period of days in order to collect the data that would be needed for a machine learning project.

      Sometime in the 2000's data started to be generated and collected as "exhaust" from various processes. People and organisations became instrumented in the sense that their daily activities were necessarily captured digitally. For a time this data was latent, people didn't really think about using it in the way that we think about it now, but by about 2010 it was obvious that not only was this data available but we had the processing and data systems to use it effectively.

    • kleiba 2 months ago

      Answering to people arguing against my comment: you guys do not seem to take into account that the technical circumstances were totally different thirty, twenty or even ten years ago! People would have liked to train with more data, and there was a big interest in combining heterogeneous datasets to achieve exactly that. But one major problem was the compute! There weren't any pretrained models that you specialized in one way or the other - you always retrained from scratch. I mean, even today, who's get the capability to train a multibillion GPT from scratch? And not just retraining once a tried and trusted architecture+dataset, no, I mean as a research project trying to optimize your setup towards a certain goal.

    • littlestymaar 2 months ago

      In 2019, GPT-2 1.5B was trained on ~10B tokens.

      Last week Hugging Face released SmolLM v2 1.7B trained on 11T tokens, 3 orders of magnitude more training data for the same number of tokens with almost the same architecture.

      So even back in 2019 we can say we were working with a tiny amount of data compared to what is routine now.

      • kleiba 2 months ago

        True. But my point is that the quote "people didn't believe in data" is not true. Back in 2019, when GPT-2 was trained, the reason they didn't use the 3T of today was not because they "didn't believe in data" - they totally would have had it been technically feasible (as in: they had that much data + the necessary compute).

        The same has always been true. There has never been a stance along the lines of "ah, let's not collect more data - it's not worth it!". It's always been other reasons, typically the lack of resources.

        • littlestymaar 2 months ago

          > they totally would have had it been technically feasible

          TinyLlama[1] has been made by an individual on their own last year, training a 1.1B model on 3T tokens with just 16 A100-40G GPUs in 90 days. It was definitely within reach of any funded org in 2019.

          In 2022 (IIRC), Google released the Chinchilla paper about the compute-optimal amount of data to train a given model, for a 1B model, the value was determined to be 20B tokens, which again is 3 orders of magnitude below the current state of the art for the same class of model.

          Until very recently (the first llama paper IIRC, and people noticing that the 7B model showed no sign of saturation during its already very long training) the ML community vastly underestimated the amount of training data that was needed to make a LLM perform at its potential.

          [1]: https://github.com/jzhang38/TinyLlama

    • kccqzy 2 months ago

      Pre-ImageNet was like pre-2010. Doing ML with massive data really wasn't in vogue back then.

      • mistrial9 2 months ago

        except in Ivory Towers of Google + Facebook

        • disgruntledphd2 2 months ago

          Even then maybe Google but probably not Facebook. Ads used ML but there wasn't that much of it in feed. Like, there were a bunch of CV projects that I saw in 2013 that didn't use NNs. Three years later, otoh you couldn't find a devserver without tripping over an NN along the way.

    • cubefox a month ago

      > That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.

      The earliest paper I know which says this explicitly is "The Unreasonable Effectiveness of Data" from 2009, only two years before AlexNet:

      https://static.googleusercontent.com/media/research.google.c...

      It's about machine translation.

    • FrustratedMonky 2 months ago

      Not really. This is referring back to the 80's. People weren't even doing 'ML'. And back then people were more focused on teasing out 'laws' in as few data points as possible. The focus was more on formulas and symbols, and finding relationships between individual data points. Not the broad patterns we take for granted today.

      • criddell 2 months ago

        I would say using backpropagation to train multi-layer neural networks would qualify as ML and we were definitely doing that in 80's.

        • UltraSane 2 months ago

          Just with tiny amounts of data.

          • jensgk 2 months ago

            Compared to today. We thought we used large amounts of data at the time.

            • UltraSane 2 months ago

              "We thought we used large amounts of data at the time."

              Really? Did it take at least an entire rack to store?

              • jensgk 2 months ago

                We didn't measure data size that way. At some point in the future someone would find this dialog, and think that we dont't have large amounts of data now, because we are not using entire solar systems for storage.

                • UltraSane 2 months ago

                  Why can't you use a rack as a unit of storage at the time? Were 19" server racks not in common use yet? The storage capacity of a rack will grow over time.

                  my storage hierarchy goes 1) 1 storage drive 2) 1 server maxed out with the biggest storage drives available 3) 1 rack filled with servers from 2 4) 1 data center filled with racks from 3

                  • fragmede 2 months ago

                    How big is a rack in VW beetles though?

                    It's a terrible measurement because it's an irrelevant detail about how their data is stored that no one actually knows if your data is being stored in a proprietary cloud except for people that work there on that team.

                    So while someone could say they used a 10 TiB data set, or 10T parameters, how many "racks" of AWS S3 that is, is not known outside of Amazon.

                    • UltraSane 2 months ago

                      a 42U 19" inch rack is an industry standard. If you actually work on the physical infrastructure of data centers it is most CERTAINLY NOT an irrelevant detail.

                      And whether your data can fit on a single server, single rack, or many racks will drastically affect how you design the infrastructure.

                      • fragmede 2 months ago

                        A standard so standard you had to give two of the dimensions so as not to confuse it with something else? Like a 48 U tall data center rack, or a 23" wide telco rack?

                        Okay, so it is relatively standard these days, but the problem is you can change how many "U" or racks you need for the same amount of storage based on how you want to arrange it, for a given use case which will affect access patterns and how it's wired up. A single server could be a compute box hosting no disks (at which point your dataset at rest won't even fit) or 4U holding 60 SATA drives vertically, at which point you could get 60*32TiB, 1.9 pebibytes for your data in 2024, but it would be a bit slow and have no redundancy. You could fit ten of those in a single rack for 19 petabytes with no tor switch, and just run twenty 1-gig Ethernet cables out (two per server) but what would be the point of that, other than a vendor trying to sell you something?

                        Anyway, so say you're told the dataset is 1 petabytes in 2024, is it on a single server or spread across many; possibly duplicated across multiple racks as well? You want to actually read the data at some point, and properly tuning storage array(s) to keeping workers fed and not bottleneck on reading the data off storage may involve some changes to the system layout if you don't have a datacenter fabric with that kind of capacity. Which puts us back at sharding the data in multiple places, at which point even though the data does fit on a single server, it's spread out across a bunch for performance reasons.

                        Trying to derive server layout from dataset size like asking about the number of lines of code used. A repo with 1 million LoC is different from one with 1,000, sure, but what can you really get from that?

      • mistrial9 2 months ago

        mid-90s had neural nets, even a few popular science kinds of books on it. The common hardware was so much less capable then.

        • robotresearcher 2 months ago

          I worked on robot control with NNs in the early-mid nineties. Maybe seven neurons and 25 edges. No layers at all. The graph and edge weights determined by a genetic algorithm. Fun.

        • sgt101 2 months ago

          mid-60's had neural nets.

          mid-90's had LeCun telling everyone that big neural nets were the future.

          • dekhn 2 months ago

            Mid 90s I was working on neural nets and other machine learning, based on gradient descent, with manually computed derivatives, on genomic data (from what I can recall, we had no awareness of LeCun; I didnt find out about his great OCR results until much later). it worked fine and it seemed like a promising area.

            My only surprise is how long it took to get to imagenet, but in retrospect, I appreciate that a number of conditions had to be met (much more data, much better algorithms, much faster computers). I also didn't recognize just how poorly MLPs were for sequence modelling, compared to RNNs and transformers.

            • sgt101 2 months ago

              I'm so out of things ! What do you mean manually computed derivatives?

              • dekhn 2 months ago

                I mean we didn't know autodifferentiation was a thing, so we (my advisor, not me) analytically solved our loss function for its partial derivatives. After I wrote up my thesis, I spent a lot of time learning mathematica and advanced calculus.

                I haven't invested the time to take the loss function from our paper and implement in a modern framework, but IIUC, I wouldn't need to provide the derivatives manually. That would be a satisfying outcome (indicating I had wasted a lot of effort learning math that simply wasn't necessary, because somebody had automated it better than I could do manually, in a way I can understand more easily).

                • telotortium 2 months ago

                  I can't express the extent to which autodifferentiation was like a revelation to me. I don't work in ML, but in grad school around 2010 I was implementing density functional theory computations in a code that was written in Fortran 77. My particular optimization needs required computing to second derivatives. I had Mathematica to actually calculate the derivatives, but even just the step of mechanically translating the computed derivatives into Fortran 77 code would be a week of tedious work. Worse was rewriting these derivative expressions for numerical stability. The worst was realizing you made a mistake in an expression high in the tree and having to rewrite everything below. The whole process took months for a single model, and that's with chain rule depth that probably could be counted on one hand. I can't imagine deep learning making the kind of progress it has without autodifferentiation - the only saving grace is that neural networks tend to be composed from large number of copies of identical functions, and you only need to go to first derivatives.

              • mistrial9 2 months ago

                it means that code has to read values from each layer and do some summarizing math, instead of passing layer blocks to a graphics card in one primitive operation implemented on the card.

                • dekhn 2 months ago

                  No. I should have said "determined the partial derivatives of the weights with respect to the variables analytically". We didn't have layers- the whole architecture was a truly crazy combination of dynamic programming with multiple different matrices and a loss function that combined many different types of evidence. AFAICT nobody does any of this any more for finding genes. We just take enormous amounts of genetic data and run an autoencoder or a sequence model over it.

  • 2sk21 2 months ago

    I'm surprised that the article doesn't mention that one of the key factors that enabled deep learning was the use of RELU as the activation function in the early 2010s. RELU behaves a lot better than the logistic sigmoid that we used until then.

    • sanxiyn 2 months ago

      Geoffrey Hinton (now a Nobel Prize winner!) himself did a summary. I think it is the single best summary on this topic.

        Our labeled datasets were thousands of times too small.
        Our computers were millions of times too slow.
        We initialized the weights in a stupid way.
        We used the wrong type of non-linearity.
      • helltone 2 months ago

        I'm curious and it's not obvious to me: what changed in terms of weight initialisation?

      • imjonse 2 months ago

        That is a pithier formulation of the widely accepted summary of "more data + more compute + algo improvements"

        • sanxiyn 2 months ago

          No, it isn't. It emphasizes importance of Glorot initialization and ReLU.

      • HarHarVeryFunny a month ago

        Also:

        nets too small (not enough layers)

        gradients not flowing (residual connections)

        layer outputs not normalized

        training algorithms and procedures not optimal (Adam, warm-up, etc)

    • cma 2 months ago

      As compute has outpaced memory bandwidth most recent stuff has moved away from ReLU. I think Llama 3.x uses SwiGLU. Still probably closer to ReLU than logistic sigmoid, but it's back to being something more smooth than ReLU.

      • 2sk21 2 months ago

        Indeed, there have been so many new activation functions that I have stopped following the literature after I retired. I am glad to see that people are trying out new things.

  • teknover 2 months ago

    “Nvidia invented the GPU in 1999” wrong on many fronts.

    Arguably the November 1996 launch of 3dfx kickstarted GPU interest and OpenGL.

    After reading that, it’s hard to take author seriously on the rest of the claims.

    • Someone 2 months ago

      I wound not call it invent”, but it seems Nvidia defined the term GPU. See https://www.britannica.com/technology/graphics-processing-un... and https://en.wikipedia.org/wiki/GeForce_256#Architecture:

      “GeForce 256 was marketed as "the world's first 'GPU', or Graphics Processing Unit", a term Nvidia defined at the time as "a single-chip processor with integrated transform, lighting, triangle setup/clipping, and rendering engines that is capable of processing a minimum of 10 million polygons per second"”

      They may have been the first with a product that fitted that definition to market.

      • kragen 2 months ago

        That sounds like marketing wank, not a description of an invention.

        I don't think you can get a speedup by running neural networks on the GeForce 256, and the features listed there aren't really relevant (or arguably even present) in today's GPUs. As I recall, people were trying to figure out how to use GPUs to get faster processing in their Beowulfs in the late 90s and early 21st century, but it wasn't until about 02005 that anyone could actually get a speedup. The PlayStation 3's "Cell" was a little more flexible.

    • rramadass 2 months ago

      After actually having read the article i can say that your comment is unnecessarily negative and clueless.

      The article is a very good historical one showing how 3 important things came together to make the current progress possible viz;

      1) Geoffrey Hinton's back-propagation algorithm for deep neural networks

      2) Nvidia's GPU hardware used via CUDA for AI/ML and

      3) Fei-Fei Li's huge ImageNet database to train the algorithm on the hardware. This team actually used "Amazon Mechanical Turk"(AMT) to label the massive dataset of 14 million images.

      Excerpts;

      “Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”

      “That moment was pretty symbolic to the world of AI because three fundamental elements of modern AI converged for the first time,” Li said in a September interview at the Computer History Museum. “The first element was neural networks. The second element was big data, using ImageNet. And the third element was GPU computing.”

    • ahofmann 2 months ago

      Wow, that is harsh. The quoted claim is in the middle of a very long article. The background of the author seems to be more on the scientific side, than the technical side. So throw out everything, because the author got one (not very important) date wrong?

      • RicoElectrico 2 months ago

        Revisionist marketing should not be given a free pass.

        • twelve40 2 months ago

          yet it's almost the norm these days. Sick of hearing Steve Jobs invented smartphones when I personally was using a device with web and streaming music years before that.

          • kragen 2 months ago

            You don't remember when Bill Gates and AOL invented the internet, Apple invented the GUI, and Tim Berners-Lee invented hypertext?

      • 2 months ago
        [deleted]
      • 2 months ago
        [deleted]
    • santoshalper 2 months ago

      Possibly technically correct, but utterly irrelevant. The 3dfx chips accelerated parts of the 3d graphics pipeline and were not general-purpose programmable computers the way a modern GPU is (and thus would be useless for deep learning or any other kind of AI).

      If you are going to count 3dfx as a proper GPU and not just a geometry and lighting accelerator, then you might as well go back further and count things like the SGI Reality Engine. Either way, 3dfx wasn't really first to anything meaningful.

      • FeepingCreature 2 months ago

        But the first NVidia GPUs didn't have general-purpose compute either. Google informs me that the first GPU with user-programmable shaders was the GeForce 3 in 2001.

    • binarybits 2 months ago

      Defining who "really" invented something is often tricky. For example I mentioned in the article that there is some dispute about who discovered backpropagation. A

      According to Wikipedia, Nvidia released its first product, the RV1, in November 1995, the same month 3dfx released its first Voodoo Graphics 3D chip. Is there reason to think the 3dfx card was more of a "true" GPU than the RV1? If not, I'd say Nvidia has as good a claim to inventing the GPU as 3dfx does.

      • in3d 2 months ago

        NV1, not RV1.

        3dfx Voodoo cards were initially more successful, but I don’t think anything not actually used for deep learning should count.

    • KevinMS 2 months ago

      Can confirm. I was playing Unreal on my dual Voodoo2 SLI rig back in 1998.

    • kragen 2 months ago

      Arguably the November 01981 launch of Silicon Graphics kickstarted GPU interest and OpenGL. You can read Jim Clark's 01982 paper about the Geometry Engine in https://web.archive.org/web/20170513193926/http://excelsior..... His first key point in the paper was that the chip had a "general instruction set", although what he meant by it was quite different from today's GPUs. IRIS GL started morphing into OpenGL in 01992, and certainly when I went to SIGGRAPH 93 it was full of hardware-accelerated 3-D drawn with OpenGL on Silicon Graphics Hardware. But graphics coprocessors date back to the 60s; Evans & Sutherland was founded in 01968.

      I mean, I certainly don't think NVIDIA invented the GPU—that's a clear error in an otherwise pretty decent article—but it was a pretty gradual process.

  • hollerith 2 months ago

    The deep learning boom caught deep-learning researchers by surprise because deep-learning researchers don't understand their craft well enough to predict essential properties of their creations.

    A model is grown, not crafted like a computer program, which makes it hard to predict. (More precisely, a big growth phase follows the crafting phase.)

    • lynndotpy 2 months ago

      I was a deep learning researcher. The problem is that accuracy (+ related metrics) were prioritized in research and funding. Factors like interpretability, extrapolation, efficiency, or consistency were not prioritized, but were clearly important before being implemented.

      Dall-E was the only big surprising consumer model-- 2022 saw a sudden huge leap from "txt2img is kind of funny" to "txt2img is actually interesting". I would have assumed such a thing could only come in 2030 or earlier. But deep learning is full of counterintuitive results (like the NFL theorem not mattering, or ReLU being better than sigmoid).

      But in hindsight, it was naive to think "this does not work yet" would get in the way of the products being sold and monetized.

    • nxobject 2 months ago

      I'm still very taken aback by how far we've been able to take prompting as somehow our universal language to communicate with AI of choice.

  • vl 2 months ago

    So the AI boom of the last 12 years was made possible by three visionaries who pursued unorthodox ideas in the face of widespread criticism.

    I argue that Mikolov with word2vec was instrumental in current AI revolution. This demonstrated the easy of extracting meaning in mathematical way from text and directly lead to all advancements we have today with LLMs. And ironically, didn’t require GPU.

    • MichaelZuo 2 months ago

      How much easier was it compared to the next best method at the time?

    • barrenko 2 months ago

      Yup, and it all started with a Master's thesis.

  • macrolime 2 months ago

    I took some AI courses around the same time as the author, and I remember the professors were actually big proponents of neural nets, but they believed the missing piece was some new genius learning algorithm rather than just scaling up with more data.

    • rramadass 2 months ago

      > rather than just scaling up with more data.

      That was the key takeaway for me from this article. I didn't know of Fei-Fei Li's ImageNet contribution which actually gave all the other researchers the essential data to train with. Her intuition that more data would probably make the accuracy of existing algorithms better i think is very much under appreciated.

      Key excerpt;

      So when she got to Princeton, Li decided to go much bigger. She became obsessed with an estimate by vision scientist Irving Biederman that the average person recognizes roughly 30,000 different kinds of objects. Li started to wonder if it would be possible to build a truly comprehensive image dataset—one that included every kind of object people commonly encounter in the physical world.

  • hyperific 2 months ago

    The article mentions Support Vector Machines being the hot topic in 2008. Is anyone still using/researching these?

    I often wonder how many useful technologies could exist if trends went a different way. Where would we be if neural nets hadn't caught on and SVMs and expert systems had.

    • bob1029 2 months ago

      I've been looking at SVMs for use with a binary classification experiment. Training and operating these models is quite cheap. The tooling is ubiquitous and will run on a toaster. A binary decision made well can be extremely powerful. Multiple binary decisions underly... gestures broadly.

      Obvious contextual/attention caveats aside, a few thousand binary classifiers operating bitwise over the training & inference sets would get you enough bandwidth for a half-ass language model. 2^N can be a very big place very quickly.

    • Legend2440 2 months ago

      Expert systems did catch on and do see widespread use - they're just not called AI anymore. It's 'business logic' or 'rules engine' now.

      The issue with SVMs is that they get intractably expensive for large datasets. The cost of training a neural network scales linearly with dataset size, while the kernel trick for SVMs scales with dataset size squared. You could never create an LLM-sized SVM.

    • spencerchubb 2 months ago

      in insurance we use older statistical methods that are easily interpretable, because we are required to have rates approved by departments of insurance

  • PeterStuer 2 months ago

    You have to understand how academia functions. It is not about what works, it is all about what is new or different in theory, regardless of what works in practice.

    So especially when an already explored (and thus claimed) theory relies on scale and emergence, it becomes near impossible for academia to reach, not because of the scale of investment needed you only have to look at physics to see funding is not a problem, but because the underlying theory is already done.

    NN theory is not very different from what we had in the 80s. The scale is many many orders of magitude beyond when I was collecting floppies from all the lab's Macs in the morning to get the results of my overnight run.

  • datavirtue 2 months ago

    It was more like: "Oh crap, here it is." It's not like the community was surprised. Many of us were educated on the principles and understood the possibilities.

    I haven't seen anything this big developed so fast in my life. I have used more Copilot UIs than I can remember. Drastic sweeping changes to how it works and integrates. Just amazing. The pretrained LLM community is hot and a lot of people are tinkering again. It reminds me of the PC explosion of the 1980s-1990s.

  • AvAn12 2 months ago

    Three legs to the stool - the NN algorithms, the data, and the labels. I think the first two are obvious but think about how much human time and care went into labeling millions of images...

    • fragmede 2 months ago

      And the compute power!

  • est 2 months ago

    I think the gpt-3 vs BERT story should be more interesting.

    I was told that OpenAI practically tried to "label every human knowledge" prior to ChatGPT, sort of similar to what Li Feifei was doing.

    At that time, no one seems to understand how to exact the massive capabilities out of gpt-3 (remember that gpt3 troll bot on 4chan?), until some joking threads on Twitter proposed "Lets think step by step..."

    That seems to be the wakeup prompt to a new AI era.

  • TheRealPomax 2 months ago

    It wasn't "almost everyone", it was straight up everyone.

  • vdvsvwvwvwvwv 2 months ago

    Lesson: ignore detractors. Especially if their argument is "dont be a tall poppy"

    • psd1 2 months ago

      Also: look for fields that have stagnated, where progress is enabled by apparently-unrelated innovations elsewhere

    • xanderlewis 2 months ago

      Unfortunately, they’re usually right. We just don’t hear about all the time wasted.

      • blitzar 2 months ago

        On several occasions I have heard "they said it couldn't be done" - only to discover that yes it is technically correct, however, "they" was on one random person who had no clue and anyone with any domain knowledge said it was reasonable.

        • friendzis 2 months ago

          Usually when I hear "they said it couldn't be done", it is used as triumphant downplay of legitimate critique. If you dig deeper that "couldn't be done" usually is in relation to some constraints or performance characteristics, which the "done" thing still does not meet, but the goalposts have already been moved.

          • Ukv 2 months ago

            > that "couldn't be done" usually is in relation to some constraints or performance characteristics, which the "done" thing still does not meet

            I'd say theoretical proofs of impossibility tend to make valid logical deductions within the formal model they set up, but the issue is that model often turns out to be a deficient representation of reality.

            For instance, Minsky and Papert's Perceptrons book, credited in part with prompting the 1980s AI winter, gives a valid mathematical proof about inability of networks within their framework to represent the XOR function. This function is easily solved by multilayer neural networks, but Minsky/Papert considered those to be a "sterile" extension and believed neural networks trained by gradient descent would fail to scale up.

            Or more contemporary, Gary Marcus has been outspoken since 2012 that deep learning is hitting a wall - giving the example that a dense network trained on just `1000 -> 1000`, `0100 -> 0100`, `0010 -> 0010` can't then reliably predict `0001 -> 0001` because the fourth output neuron was never activated in training. Similarly, this function is easily solved by transformers representing input/output as a sequence of tokens thus not needing to light up an untrained neuron to give the answer (nor do humans when writing/speaking the answer).

            If I claimed that it was topologically impossible to drink a Capri-Sun, then someone comes along and punctures it with a straw (an unaccounted for advancement from the blindspot of my model), I could maybe cling on and argue that my challenge remains technically true and unsolved because that violates one of the constraints I set out - but at the very least the relevance of my proof to reality has diminished and it may no longer support the viewpoints/conclusions I intended it to ("don't buy Capri-Sun"). Not to say that theoretical results can't still be interesting in their own right - like the halting problem, which does not apply to real computers.

          • marcosdumay 2 months ago

            It's extremely common that legitimate critique gets used to illegitimately attack people doing things differently enough that the relative importance of several factors change.

            This is really, really common. And it's done both by mistake and in bad faith. In fact, it's a guarantee that once anybody tries anything different enough, they'll be constantly attacked this way.

      • vdvsvwvwvwvwv 2 months ago

        What if the time wasted is part of the search? The hive wins but a bee may not. (Capitalism means some bees win too)

        • xanderlewis 2 months ago

          It is. But most people are not interested in simply being ‘part of the search’ — they want a career, and that relies on individual success.

    • jakeNaround 2 months ago

      The lesson is reality is not the dialectics and symbolic logic but all the stuff in it.

      Study story problems and you end up with string theory. Study data computed from endless world of stuff, find utility.

      What a shock building the bridge is more useful than a drawer full of bridge designs.

      • aleph_minus_one 2 months ago

        > What a shock building the bridge is more useful than a drawer full of bridge designs.

        Here, opinions will differ.

  • icf80 2 months ago

    logic is data and data is logic

  • madaxe_again 2 months ago

    I can’t be the only one who has watched this all unfold with a sense of inevitability, surely.

    When the first serious CUDA based ML demos started appearing a decade or so ago, it was, at least to me, pretty clear that this would lead to AGI in 10-15 years - and here we are. It was the same sort of feeling as when I first saw the WWW aged 11, and knew that this was going to eat the world - and here we are.

    The thing that flummoxes me is that now that we are so obviously on this self-reinforcing cycle, how many are still insistent that AI will amount to nothing.

    I am reminded of how the internet was just a fad - although this is going to have an even greater impact on how we live, and our economies.

    • xen0 2 months ago

      What makes you think AGI is either here or imminent?

      For me the current systems still clearly fall short of that goal.

      • madaxe_again 2 months ago

        They do fall short, but progress in this field is not linear. This is the bit that I struggle to comprehend - that which was literally infeasible only a few years ago is now mocked and derided.

        It’s like jet engines and cheap intercontinental travel becoming an inevitability once the rubicon of powered flight is crossed - and everyone bitching about the peanuts while they cruise at inconceivable speed through the atmosphere.

        • diffeomorphism 2 months ago

          Just like supersonic travel between Europe and America becoming common place was inevitable. Oh, wait.

          Optimism is good, blind optimism isn't.

          • madaxe_again 2 months ago

            It is yet inevitable - but it wasn’t sustainable in the slightest when it was first attempted - Concorde was akin to the Apollo programme, in being precocious and prohibitively expensive due to the technical limitations of the time. It will, ultimately, be little more remarkable than flying is currently, even as we hop around on suborbital trajectories.

            It isn’t a question of optimism - in fact, I am deeply pessimistic as to what ML will mean for humanity as a whole, at least in the short term - it’s a question of seeing the features of a confluence of technology, will, and knowledge that has in the past spurred technical revolution.

            Newcomen was far from the first to develop a steam engine, but there was suddenly demand for such beasts, as shallow mines became exhausted, and everything else followed from that.

            ML has been around in one form or another for decades now - however we are now at the point where the technology exists, insofar as modern GPUs exist, the will exists, insofar as trillions of dollars of investment flood into the space, and the knowledge exists, insofar as we have finally produced machine learning models which are non-trivial.

            Just as with powered flight, the technology - the internal combustion engine - had to be in place, as did the will (the First World War), and the knowledge, which we had possessed for centuries but had no means or will to act upon. The idea was, in fact, ridiculous. Nobody could see the utility - until someone realised you could use them to drop ordnance on your enemies.

            With supersonic flight - the technology is emerging, the will will be provided by the substantial increase in marginal utility provided by sub-hour transit compared to the relatively small improvement Concorde offered, and the knowledge, again, we already have.

            So no, not optimism - just observance of historical forces. When you put these things together, there tend to be technical revolutions, and resultant societal change.

            • marcosdumay 2 months ago

              > insofar as modern GPUs exist, the will exists, insofar as trillions of dollars of investment flood into the space

              Funny thing. I expect deep learning to have a really bad next decade, people betting on it to see quick bad results, and maybe it even disappear from the visible economy for a while exactly because there have been hundreds of billions of dollars invested.

              What is no fault of the technology, that I expect to have some usefulness on the long term. I expect a really bad reaction to it coming exclusively from the excessive hype.

            • xen0 2 months ago

              > It is yet inevitable

              The cool thing for you is that this statement is unfalsifiable.

              But what I really took issue with was your timeframe in the original comment. Your statements imply you fully expect AGI to be a thing within a couple of years. I do not.

    • oersted 2 months ago

      What do you think is next?

      • madaxe_again 2 months ago

        An unravelling, as myriad possibilities become actualities. The advances in innumerate fields that ML will unlock will have enormous impacts.

        Again, I cannot understand for the life of me how people cannot see this.

        • alexander2002 2 months ago

          I had a hypothesis once and It is probably 1000% wrong. But I will state here. /// Once computers can talk to other computers over network in human friendly way <abstraction by llm> and such that these entities completely control our interfaces which we humans can easily do and use them effectively multi-modality then I think there is a slight chance "I" might belive there is AGI or atleast some indications of it

          • marcosdumay 2 months ago

            It's unsettling how the Turing Test turned out to be so independent of AGI, isn't it?

            • Terr_ 2 months ago

              Not really, unless someone reading pop-science misunderstood the "Turing Test" as somehow being clear proof of intelligence--whatever that word really means.

              • madaxe_again 2 months ago

                Indeed. Until we know exactly how a human brain works, we should be cautious about describing humans as intelligent. It could just be a simulation of intelligence, for all we know.

                • Terr_ 2 months ago

                  "If some human can be subjectively convinced that something is $X, then it must be $X!"

                  *Bzzt* Circular logic solipsism cleanup on aisle 3.

              • marcosdumay 2 months ago

                I don't think anybody expected it to be a "clear proof". But it was very reasonable, and I never saw anybody disagree that they should be close together.

                How could a computer look intelligent for random people if it was not at least something close to intelligent? Of course, now we know how. But it was really not obvious that those things would be completely different.

                (And yeah, it was obvious that they are not completely the same either. Lots of people convince people that they are more intelligent than they are. For people, that still requires some amount of intelligence.)

        • selimthegrim 2 months ago

          Innumerable?

    • BriggyDwiggs42 2 months ago

      Downvoters are responding to a perceived arrogance. What does agi mean to you?

      • nineteen999 2 months ago

        Could be arrogance, or could be the delusion.

        • BriggyDwiggs42 2 months ago

          Indeed, it sure could be arrogance.

        • madaxe_again 2 months ago

          Why is it a delusion, in your opinion?

          • andai 2 months ago

            It's a delusion on the part of the downvoters.

  • arcmechanica 2 months ago

    It was basically useful to average people and wasn't just some way to steal and resell data or dump ad after ad on us. A lot of dark patterns really ruin services.