Diffusion models are evolutionary algorithms

(gonzoml.substack.com)

113 points | by che_shr_cat 5 days ago ago

23 comments

  • upghost 17 hours ago

    This article uses the term "evolutionary algorithm" far too narrowly and it is causing a lot of confusion in the comments. I would STRONGLY recommend checking out the book "Evolutionary Optimization Algorithms" by Dan Simon[1]. It is incredibly readable. The type being referred to in the article is the classic "Genetic Algorithm" variant of evoluationary algorithms. But genetic programming, evolutionary programming, simulated annealing, ant colony optimization, particle swarm optimization, differential evolution, estimation of distribution algorithms, biogeography-based optimizations, cultural algorithms, and opposition-based learning algorithms are just a few examples of other types of evolutionary algorithms.

    In general, they are a great approach to solving any non-convex optimization problem where gradient descent is not a practical choice.

    [1]: https://books.google.com/books/about/Evolutionary_Optimizati...

    • sevensor 2 hours ago

      From the evolutionary algorithms perspective, the work presented in the article is tremendously ordinary, even formulaic. There’s always somebody publishing a new evolutionary search operator with a flavor of the month biological analogy. The test functions are bog standard, and although they’re designed to present certain difficulties, they do a pretty bad job of resembling actual black box engineering models; their main advantages are fast evaluation and analytically known optima. Your best bet for single objective optimization is probably still CMAES.

    • ackbar03 8 hours ago

      No, the comments are not confused. I wouldn't consider myself as a expert, but I've worked a lot with evolutionary algorithms. This is not a good work and should be rejected from ICLR, although again it is my personal opinion. Terminologies are there to help distinguish between different concepts, not to be obscured in some handwavy way to seem deep and knowledgable to sucker some ICLR paper reviewer with no scientific rigor. It is indeed possible to optimize NNs using GAs but claiming diffusion is an evolutionary algorithm is a huge stretch, and they even didn't draw the connection properly in the paper. There might be something worthwhile in their work but I would say portraying it this way is just flat-out not correct.

      • upghost 6 hours ago

        I think we are agreeing. I'm not claiming the comments are incorrect in the regard you are mentioning, I mean to say that the paper is conflating genetic algorithms (GAS) as tho they are the only sort of EA.

        Words mean things and my goal is to enhance the discourse by clarifying the imprecise language used in the paper regarding the term "Evolutionary Algorithm".

        I make no comment on any other issues with the paper.

  • bob1029 4 days ago

    I have a hard time with the analogy due to how important population dynamics and solution diversity are to evolutionary algorithms.

    In an EA, each candidate in the population represents a complete potential solution. As the diversity & size of the population increases, the potential for convergence on high quality solutions also increases. I do not see the same concept in diffusion models.

    • upghost 6 hours ago

      I would encourage you to check out the "No Free Lunch" paper[1]. tl;dr the correct approach depends on the problem, there is no approach that is superior to all others. There are a provably infinite number of optimization problems where RANDOM GUESSING is superior to any other algorithm.

      (Note: I am not claiming support for the methodology in the paper, I just find the application of search/optimization algos is lovely)

      That being said back before I understood how to program gradient descent I would often use Evolutionary Algorithms such as simulated annealing to optimize NNs! (never claimed it was state of the art, I just wanted to build and optimize some small networks by hand and my calculus was pretty weak at the time)

      [1]: https://en.m.wikipedia.org/wiki/No_free_lunch_in_search_and_...

    • dawnofdusk 20 hours ago

      >I do not see the same concept in diffusion models.

      I believe,

      Population size = noise dimensionality

      Population diversity = noise magnitude

      • amelius 17 hours ago

        But iiuc, an EA algorithm needs to keep an entire population in memory at once.

        I don't think this is the case for diffusion models.

        • bob1029 17 hours ago

          You can maintain a population of arbitrary size when using approaches such as tournament selection. Each worker can pull a subset of the population from another datacenter and run an isolated tournament. You also don't have to run the entire population each iteration. You can keep win-loss statistics and handle final selection/crossover/mutation in a separate thread once enough have accumulated.

          Put more simply, run it like chess and use things like Elo rating to determine fitness at selection time.

    • EvoQ 15 hours ago

      I think that evolutionary algorithms, updates are by natural selection, diffusion models, updates by denoising. Perturbations in evolutionary algorithms using mutations, difusion model is done with difusion process.

      Seems cool!

      • flir 3 hours ago

        > evolutionary algorithms, updates are by natural selection, diffusion models, updates by denoising

        but are they not both fitness functions? couldn't an EA use something akin to denoising as its fitness function and still be an EA?

        (genuine question, complete layman here).

    • nickpsecurity 12 hours ago

      I’ll add to your observation about them being complete solutions. Also, since the author thinks they’re like NN’s, we’ll do a direct comparison where members are a complete solution. Here’s an evolutionary algorithm:

      1. A series of NN models with random weights.

      2. A way to score how well each NN model works.

      3. Iteratively mixing pieces of models with each other based on how well they scored.

      I didn’t see that in articles about deep learning at all. It was like they were totally, different things. They were so different that I looked into using both GA’s and simulated annealing for optimization of NN’s.

      I’d say that EA’s and NN-like methods are different and complementary.

    • wakawaka28 15 hours ago

      I agree with you. I suppose you could simulate having a large population by maintaining a set of snapshots, but it still seems like not the same thing as a classic evolutionary technique. There is no sane way to do things like crossover I guess, unless you can average the weights from multiple snapshots (a strategy that might not work at all IMO).

      I think I have seen research papers addressing the crossover of neural network optimization and EA, but I think direct optimization techniques have been preferred for managing the actual weights. Manipulating the input data is more along the lines of perturbation methods. If you look hard enough you can probably see whatever you want but it's a stretch.

      To me, neural network training is a distinct task that draws from many areas of optimization and does not fit neatly under any one umbrella.

      • bob1029 2 hours ago

        > I suppose you could simulate having a large population by maintaining a set of snapshots, but it still seems like not the same thing as a classic evolutionary technique.

        I've got an experiment using this exact technique running on a machine right now, but I wouldn't argue that it is a good alternative for an actual population.

        The reason I attempted this is because snapshotting a candidate is an unbelievably cheap operation. You don't have to allocate memory once the snapshot buffer is in place. It's just a bunch of low-level memory block copy operations. I can achieve several million iterations per candidate per hour with this hack, but only at the expense of maintaining a large population.

        My conclusions so far are that it is interesting to use a snapshot/restore approach, but it seems to only take you into local minima faster than everything else. Real population diversity is an information theory thing that will implicate memory bandwidth, et. al., and slow things down by 3-4 OOM from where we'd prefer to be.

      • jcgrillo 13 hours ago

        Admittedly I haven't dug too deep into the references, including the paper the article refers to--and I'm not too well versed in diffusion models either. Those caveats aside, I'm also having trouble making the analogy, and similarly for me the hangup is about crossover. If I recall correctly, what distinguishes evolution from, say, simulated annealing at varying temperatures, is precisely the crossover operator. It's not obvious where crossover is happening here.

        • robotresearcher 9 hours ago

          Crossover is not a defining or essential part of an EA. Reproduction with variation under selection pressure is the definition. Crossover is one possible mechanism of reproduction with variation. Simple mutation is another.

          • wakawaka28 an hour ago

            I would not call back propagation and similar techniques "mutation". It certainly isn't random in the same sense as other random searches like EA. Maybe the exact mechanism of iteration is not that important as long as some aspect of it is random, but it's still a novel interpretation of EA.

          • jcgrillo 4 hours ago

            Fair enough, I'm convinced now.

  • throwaway314155 19 hours ago

    Michael Levin's work is fascinating. Seems there's no field he can't help contribute to from a biological perspective.

  • adamnemecek 4 days ago

    They are all bialgebras.

    • will_byrd 4 days ago

      Interesting comment. Would you mind expanding on that observation? Are there any references you'd suggest looking at that help make the connection more clear? Thank you!

    • tbalsam 4 days ago

      I saw this comment, thought, "yeah this reminds me of Adam Nemeck", and lo and behold.

  • SubiculumCode 20 hours ago

    In the end. Linear Regression.