Understanding Gaussians

(gestalt.ink)

152 points | by lapnect 8 months ago ago

35 comments

mjhay 8 months ago

Great article, but I wish it would have made a more explicit mention of the* central limit theorem (CLT), which I think is what makes the normal distribution "normal." For those not familiar, here is the jist: suppose you have `n` independent, finite-variance random variables with support in the real numbers (so things like count R.V.s work). Asymptotically, as n->infinity, the distribution of the mean will approach a normal distribution. Usually, n doesn't have to be big for this to be a reasonable approximation. n~30 is often fine. The CLT extends in a

To me, this is one of the most astonishing things about probability theory, as well as one of the most useful.

The normal distribution is just one of a class of "stable distributions," all sharing the properties of sums of their R.V.s being in the same family.

The same idea can be generalized much further. The underlying idea is the distribution of "things" as they get asymptotically "bigger." The density of eigenvalues of random matrices with I.I.D entries approach the Wigner Semicircle Distribution, which is exactly what it sounds like. It plays the role of the normal distribution in the very practically-promising theory of free (noncommutative) probability.

https://en.wikipedia.org/wiki/Wigner_semicircle_distribution

Further reading:

https://terrytao.wordpress.com/2010/01/05/254a-notes-2-the-c...

*there's a few normal distribution CLTs, but this is the intuitive one that usually matters in practice

[-]

mturmon 8 months ago

> ...most astonishing things about probability theory...

It's a core result, perhaps the most useful core result of standard probability theory.

But from some points of view, the CLT is not actually astonishing.

If you know what Terry Tao (in the convenient link above) calls the "Fourier-analytic proof", the CLT for IID variables can seem inevitable, as long as the underlying distribution is such that the moment generating function (density Fourier transform) of the first summand exists.

I'd be interested to hear if you have sympathy with the following reasoning:

The Gaussian distribution corresponds to a MGF with second-order behavior like 1 - t^2/2 around the origin. You only care about MGF behavior around the origin because, as N -> \infty, that's all that matters.

Because of the way we normalized the sum (we subtracted the mean), the first-order term in the MGF will vanish. We purposely zeroed it out by centering the sum around zero. That leaves the second-order term, which will give a Gaussian distribution.

So in short:

    - MGF of one summand exists => MGF of (recentered) sum exists
    - We have an expression for the MGF of the recentered sum (convolution property)
    - Only the MGF behavior around the origin matters
    - We re-center the sum, causing the first-order term to vanish
    - We invert the resulting MGF and recover the Gaussian

I'm not being precise here, but I hope the idea comes through.

[-]

mjhay 8 months ago

Hi, I didn't see this reply before, but I think that's a wonderfully simple way of looking at it. Thanks for the intuition and step-by-step construction.

That makes me think of the normal distribution and the heat kernel, which I'd be very interested to hear your thoughts on. The heat kernel is the Green's function solution of the heat equation (governing heat or other diffusive transport in the absence of material movement transporting the quantity along with it):

dT/dt = dT/dx^2 (pretend those are partial derivatives)

If we have an initial condition of T=1 at the origin and zero everywhere else (e.g., a spiked Dirac delta), the solution at time t>0 is the same as a normal distribution after appropriate normalization. The variance simply gets spatially bigger as time goes on and the thermal energy continues to diffuse.

Thinking about your intuition, the first derivative at the origin is zero as well (because the heat should diffuse the same in every direction absent any conductivity anisotropy). The second derivative is also near zero around the origin, for sufficiently small distances and sufficiently large T>0.

Because the heat kernel is flat to second (not first) order near the origin, heat flux vanishes to second order. At sufficiently large distances/small times, the flux vanishes to any order. The sweet spot is in the middle, at the steep part of the heat kernel/Gaussian. There, difference of heat and exiting a point is infinitesimally different, to the first order of the temperature gradient. But that doesn't mean that heat transport in an out of a point is proportional to the temperature gradient! One point can only transport heat to infinitesimally nearby points. The difference in temperature between infinitesimally close points, where the temperature gradient is infinitesimal, is doubly infinitesimal.

abetusk 8 months ago

Good for you for stating the assumptions properly that go into the CLT and for mentioning other stable distributions.

I disagree about the Gaussian being the "normal" case or the "one that usually matters". Finite variance is a big assumption and one that's routinely violated in practice.

For those that are interested, Levy-stable distributions are the general class of convergent sums of random variables [0], synonymously called "fat-tailed" or "heavy-tailed" distributions and include Pareto [1] and the Cauchy distributions [2].

Is there an intuitive explanation for why the Wigner semicircular law is basically the "logarithm" the Gaussian in some respect?

[0] https://en.wikipedia.org/wiki/L%C3%A9vy_distribution

[1] https://en.wikipedia.org/wiki/Pareto_distribution

[2] https://en.wikipedia.org/wiki/Cauchy_distribution

[-]

CrazyStat 8 months ago

“Normal” in the context of the normal distribution actually derives from the technical meaning of normal as perpendicular, like the normal map in computer graphics. The linguistic overloading with normal in the sense of usual or ordinary is an unfortunate coincidence.

[-]

abetusk 8 months ago

It looks like that story is apocryphal.

There's a reddit question which refutes this idea [0] and provides some sources (which are paywalled) [1] [2].

That reddit question also has a source [3] that claims Galton used the term "normal" in the "standard model, pattern type" sense from the 1880s onwards:

""" ... However in the 1880s he began using the term "normal" systematically: chapter 5 of his Natural Inheritance (1889) is entitled "Normal Variability" and Galton refers to the "normal curve of distributions" or simply the "normal curve." Galton does not explain why he uses the term "normal" but the sense of conforming to a norm ( = "A standard, model, pattern, type." (OED)) seems implied. """

Though I haven't confirmed, it looks like Gauss never used the term "normal" to denote orthogonality of the curve.

Do you have a source?

[0] reddit.com/r/statistics/comments/rvuj4r/q_why_did_karl_pearson_call_the_gaussian

[1] https://www.google.co.uk/books/edition/Statistics_and_Public...

[2] https://www.jstor.org/stable/2684625

[3] https://condor.depaul.edu/ntiourir/NormalOrigin.htm

wodenokoto 8 months ago

> You can see that the data is clustered around the mean value. Another way of saying this is that the distribution has a definite scale. [..] it might theoretically be possible to be 2 meters taller than the mean, but that’s it. People will never be 3 or 4 meters taller than the mean, no matter how many people you see.

The way the author defines definite scale is that there is a max and a minimum, but that is not true for a gaussian distribution. It is also not true that if we keep sampling wealth (an example of a distribution without definite scale used in the article), there is no limit to the maximum.

[-]

klysm 8 months ago

I think he’s saying that the distribution of human heights has definite scale, not the Gaussian?

[-]

dekhn 8 months ago

Human height (by gender) very nearly follows a gaussian distribution- because height is determined (hand-wave away complexity) by a sum of many independent random variables. In reality it's not truly gaussian for a number of reasons.

wodenokoto 8 months ago

No, author very much says the Gaussian has definite scale:

> There are a few distributions like this with a definite scale, but the Gaussian is the most famous one.

deepnet 8 months ago

Jinlian (1964–1982) of China was 8 feet, 1 inch (2.46 centimeters) when she died, making her the tallest woman ever. According to Guinness World Records, Zeng is the only woman to have passed 8 feet (about 2.44 meters)

Mean from article 163.

So the facts check out.

Author is correct.

Also very interesting the suggestion that human height is not Gaussian.

Snip :

“ Why female soldiers only? If we were to mix male and female soldiers, we would get a distribution with two peaks, which would not be Gaussian.

“

Which begs the question what other human statistics are non Gaussian if sexes are mixed and does this apply to other strong differentiators like historical time, nutrition, neural tribes ?

Statistics is highly non-trivial. “

nwnwhwje 8 months ago

Nothing is Gaussian then. What probability distribution allows for Graham's Number to be a possibility?

[-]

klysm 8 months ago

The Gaussian has non-zero mass everywhere.

shiandow 8 months ago

It's an oversimplification but at some point there is really no difference between impossible and 'incredibly small probability'.

I mean sure it is possible for all air molecules to randomly all go to the same corner of the room at the same time (heck it is inevitable in some sense), you can play it back in reverse to check no laws of physics were broken, but practically that simply does not happen.

[-]

KK7NIL 8 months ago

> at some point there is really no difference between impossible and 'incredibly small probability'.

This is not true.

Using your air molecules example: Every microstate (i.e. location and speed of all the molecules) possible under the given macrostate (temperature, number of molecules, etc) has a probability of happening of 0, but aren't impossible, simply because the microstates are real variables and real numbers are uncountable. Impossible microstates also have 0 probability but are obviously not the same.

[-]

shiandow 8 months ago

A bit late, but if you do consider events with vanishingly small probability to be impossible then they do become equal.

It's just that you then have to contend with the paradox that impossible events happen all the time, just not the ones you had in mind.

tylerneylon 8 months ago

I like the font, images, and layout of this article. Does anyone happen to know if a tool (that I can also use) helped achieve this look?

Or if not, does anyone know how to reach the author? I may have missed it, but I didn't even see the author's name anywhere on the site.

[-]

generuso 8 months ago

The author is Peter Bloem, and the html is compiled from these sources: https://github.com/pbloem/gestalt.ink

with the help of mathjax: https://www.mathjax.org/

The font seems to be Georgia.

[-]

creata 8 months ago

The CSS says:

    font-family: charter, Georgia, serif;

You can get a convenient copy of Charter here: https://practicaltypography.com/charter.html

Another free font based on (and largely identical to?) Charter is Charis: https://software.sil.org/charis/

[-]

generuso 8 months ago

Indeed. My mistake.

tylerneylon 8 months ago

Thank you!

esafak 8 months ago

The maths typeface is Neo-Euler: https://fontlibrary.org/en/font/euler-otf

hughw 8 months ago

Gaussian, Gaussian, Gaussian. Important to understand Gaussians, but also to recognize how profoundly non-Gaussian, in particular multimodal, the world is. And to build systems that navigate and optimize over such distributions.

(Not complaining about this article, which is illuminating).

[-]

photochemsyn 8 months ago

A particularly interesting case is Maxwell-Boltzmann distributions of the speeds of molecules in a gas in a 3D space. Even though the individual velocities of gas molecules along the x, y and z directions do follow Gaussian distributions, the distributions of scalar speeds do not (since the speed is obtained from the velocities by a non-linear transformation), resulting in a long tail of high velocities, and a median value less than the mean value.

Incidentally human expertise and ability seems to follow the Maxwell-Boltzmann model far more than the Gaussian 'bell curve' model - there's a long tail of exceptional capabilities.

slashdave 8 months ago

There was an opportunity when heights of soldiers were discussed. Gaussians have infinite extent, but soldier heights must be positive.

[-]

hughw 8 months ago

Good example

8 months ago

[deleted]

lamename 8 months ago

> The best way to do that, I think, is to do away entirely with the symbolic and mathematical foundations, and to derive what Gaussians are, and all their fundamental properties from purely geometric and visual principles. That’s what we’ll do in this article.

Perhaps I have a different understanding of "symbolic". The article proceeds to use various symbolic expressions and equations. Why say this above if you're not going to follow through? Visuals are there but peppered in.

[-]

Torkel 8 months ago

Agree. This text relies heavily on traditional mathematics to define and work through things. It's quite good at that! But it does become weird when it starts out by declaring that it won't do what it then does.

It also felt like this could be a good topic for a 3b1b video... and... here's the 3b1b video on gaussians: https://www.youtube.com/watch?v=d_qvLDhkg00

mhh__ 8 months ago

https://gregorygundersen.com/blog/2020/04/11/moments/

youoy 8 months ago

Thanks for sharing! The Gaussian distribution never gets old. And nice plot of this:

> 100 000 points drawn from a Gaussian distribution and passed through a randomly initialized neural network.

It gives you a sense of how complex the folding of the space by NNs can be. And also the complexity of the patterns that they can pick up.

dian_hacks 8 months ago

> If we want to stretch a function f(x) vertically by a factor of y, we should multiply its input by 1/y: f(1/y x)

I didn't quite follow this part.

[-]

FabHK 8 months ago

Possibly the author meant "horizontally".

brcmthrowaway 8 months ago

Now explain Gaussian splatting

[-]

CamperBob2 8 months ago

He never gets as far as splatting, but if you follow the links on the page you eventually end up at a really nice set of lecture notes on Gaussian diffusion: https://dlvu.github.io/pdfs/lecture11.diffusion.annotated.pd...