Probability-Generating Functions

(entropicthoughts.com)

143 points | by todsacerdoti 10 hours ago ago

36 comments

  • KvanteKat 6 hours ago

    For those interested in looking slightly more into the characteristic function, it may be worth pointing out that the characteristic function is equal to the Fourier-transform (with the sign of the argument being reversed) of the probability distribution in question.

    In my own experience teaching teaching probability theory to physicists and engineers, establishing this connection is often a good way of helping people build intuition for why characteristic functions are so useful, why they crop up everywhere in probability theory, and why we can extract so much useful information about a distribution by looking at the characteristic function (since this group of students tends to already be rather familiar with Fourier-transforms).

    • jamessb 5 hours ago

      Yes, this provides good intuition about why it is useful: the PDF of the sum of two random variables is the convolution of the original PDFs. A convolution is awkward to work with, but by the convolution theorem it is a multiplication in the Fourier domain. This immediately suggests that the Fourier transform of a PDF would be a useful thing to work with.

      If you don't say that this is what you are doing then it all seems quite mysterious.

      • creata an hour ago

        > the PDF of the sum of two random variables is the convolution of the original PDFs

        (Probably obvious to everyone reading, but the variables should be independent.)

    • fermisea 4 hours ago

      As a physicist, the moment when everything just clicked was when I realised that connected Feynman diagrams were basically the cumulants of that distribution. Then almost everything in physics is about "what is the characteristic/moment/cumulant generating function?" and associated Legendre transforms

    • nycticorax 2 hours ago

      I feel like it's almost criminal of textbook writers not to mention this when introducing the characteristic function... At least as an aside or a footnote, for readers already familiar with Fourier transforms.

    • bc569a80a344f9c 6 hours ago

      I had not made that connection and find that incredibly useful. Thank you for pointing that out.

    • ysofunny 5 hours ago

      but isn't a characteristic function just "the" way to bridge the gap between sets, functions, and logic(? ...a 3way bridge!?)

      I mean, it was useful for me to think about like a translation between sets and logic (this variable x is in the set xor not) into functions (a function f(x) that returns 1 or true whenever x is in set S)

      how the heck is that a fourier transform!??

      • jamessb 5 hours ago

        You're thinking of a "characteristic function" in the sense of "indicator function" of a subset (https://en.wikipedia.org/wiki/Indicator_function), which is different thing to the characteristic function of a probability density function.

      • KvanteKat 4 hours ago

        You can think of it like this:

        - The characteristic function of a random variable X is defined as the function that maps t --> ExpectedValue[ exp( i * t * X ) ]

        - Computing this expected value is the same as regarding t as a constant and integrating the function x --> exp( i * t * x) with respect to the distribution of X, i.e. if X has the density f, we compute the integral of f(x) * exp( i * t * x) with respect to x over the domain of f.

        - on the other hand: computing the Fourier transform of f (here representing the density of X) and evaluating it at point t (i.e. computing (F(f))(t) if F represents the Fourier transform) is the same as fixing t and computing the integral of f(x) * exp( -i * t * x) with respect to x.

        - Rearranging the integrand in the previous expression to f(x) * exp( i * -t * x), we see that it is the same as the integrand used in the characteristic function, only with a -t instead of a t.

        Hope that helps :)

      • beagle3 3 hours ago

        “Characterstic function” is (was) an overloaded term.

        What you described is more often referred to as an “indicator function” these days, with “characteristic functions” denoting the transform (Fourier, laplace, z - depending on context). Closely related to “moment generating functions” to the point of being almost interchangeable.

        • ysofunny 25 minutes ago

          so the same thing but, characterisic function as I knew them before these posts is a rudimentary 2-variable finite version. point and line (but the line is a curve, a circle because e).

          but the new and improved 21st century characteristic functions are n-variable and have a full continious spectrum of variables between zero (false) and one (true) but only potentially lest infinite realizes itself (which would make the theories illogical).

          this way of thinking about this makes sense to me, even if it's ever so slighly wrong by some nitpickable point https://en.wikipedia.org/wiki/Moment-generating_function

      • steppi 5 hours ago
  • sampo 6 hours ago

    Herbert S. Wilf (1990): Generatingfunctionology. https://www2.math.upenn.edu/~wilf/gfology2.pdf

  • Sharlin 3 hours ago

    Probably worth noting that as we know know, polynomials (over a field) are a vector space, not just convertible to one. The set of formal variables { x^0, x^1, x^2, … } is an orthogonal basis.

  • pedroth 6 hours ago

    "I have long struggled with understanding what probability-generating functions are and how to intuit them. There were two pieces of the puzzle missing for me, and we’ll go through both in this article."

    Great article. For more, I really recommend Analytic Combinatorics:

    https://ac.cs.princeton.edu/home/

    • hintymad 36 minutes ago

      Second this. This class is a classical example of conceptual blockbuster. Once one learns it, the complexity analysis of algorithms will never be the same again. In general, if a techie wants to spend their spare time learning new stuff, they will be better off focusing more on such conceptual stuff, as the return will compound over the years.

  • jldugger 2 hours ago

    I've always wondered why the hell generating functions existed, and I think this line sums it up:

    > When de Moivre invented much of modern probability in the mid-1700s, he didn’t have vectors! Vectors are an 1800s invention.

    Doesn't explain why we still teach them 300 years later though. Thats what the second half of the article covers.

  • gorgoiler 5 hours ago

    ”If we want to encode the vector [6,2,8,4] in a single expression we can create a function containing those numbers: f(x) = 6 + 2x² + 8x³ + 4x⁴

    …or if you flip the vector and use x=10:

      6284
    • Sharlin 3 hours ago

      Yes, but the polynomial form generalizes to coefficients of an arbitrary field, not just naturals. If your vector were, say, [1.3, 2.197656, pi, -1/2, 3*2i] then there wouldn’t be a reasonable base you could pick for a place-value representation.

    • AgentMatt 4 hours ago

      Indeed. However, note that this is limited to encoding values between 0 and 9.

  • derefr 2 hours ago

    Is there a relationship between algebraic polynomial encoding of sequences, and https://en.wikipedia.org/wiki/G%C3%B6del_numbering_for_seque... ?

    Does an encoding of a sequence in a given Gödel numbering, also somehow "retrievably" encode the probability space of the sequence's terms?

  • shiandow 7 hours ago

    Probably worth mentioning the moment-generating function as well, since it's a bit more elementary than the characteristic function and shares many of the properties. It's also more simply related to the probability generating function, you can go from one to the other with a basic change of coordinates (t -> log(x)). I also estimate calculating the moment generating function to be easier in most cases.

    In fact most properties of the PGF come from the monent-generating/characteristic function. Including why the second derivative is related to the variance. The second derivative of the moment generating function is the second moment E[X^2]. The second derivative of the logarithm of the MGF is the variance by definition.

    The one property that's somewhat unique to the PGF is how composition relates to drawing a randomly-sized sample, which I can see could be useful.

    • adiM 6 hours ago

      One can also think of probability generating functions as (flipped) Z transforms, moment generating functions as (flipped Laplace transforms), and characteristic functions as Fourier transforms of the respective PMF/PDF. Lot of their properties then follow from simple properties of Signals and Systems.

      • glial 4 hours ago

        Do you have a reference that explains this in more detail? I'd be curious to know.

        • adiM 2 hours ago

          Don't have a reference on the top of my head, but the main idea is as follows:

          The definition of MGF of a random variable with PDF f(x) is

          E[e^{sX}] = int_{-inf}^{inf} f(x) e^{sx} dx

          The definition of Laplace Transform of a signal f(t) is

          F(s) = int _{-inf}^{inf} f(t) e^{-st} dt

          Hence MGF is 'flipped' Laplace transform

          Now for we know that the MGF of sum independent RVs is the product of their MGFs. So if we take the inverse Laplace transform, the density of the sum is convolution of the individual densities.

          Similarly, if we take derivative in frequency domain, that is same as multiplying in time domain: So M'_X(s) is the 'flipped Laplace transform' of x f(x) and its value at s=0 is the 'DC-gain' of the signal.

          And so on... the properties are all immediate consequence of the definition of MGF and since the definition is essentially the same as that of a Laplace transform , there is an equivalent property in signals and systems as well.

  • esafak 3 hours ago

    I rather like the derivation of the CLT using MGFs.

    https://courses.cs.washington.edu/courses/cse312/20su/files/...

  • JPC21 6 hours ago

    Generating functions are also a great tool in combinatorics (see for example the book Analytic Combinatorics by Flajolet and Sedgewick).

  • v9v 6 hours ago
  • vector_spaces 5 hours ago

    I think the 12th order G(t) example is missing another term with coefficient 1/5 -- since these coefficients must sum to 1

    • stocknoob 5 hours ago

      Unless it’s already been fixed, the queen’s term (t^12) has 2/5 in the article.

  • jampekka 6 hours ago

    Lost the plot at "... and for sensible v, this is equivalent to ..." :(

    • Fellshard 6 hours ago

      For that, I would look at early calculus/pre-calc, I think, examining infinite series and their properties and equivalencies.

      There's certain forms like that that have well known values that they converge to as you continue adding terms into infinity. Sometimes that convergence is only possible if your domain is limited, eg. [0,1].

      • jampekka 5 hours ago

        That clears it, thanks! I didn't figure out that "sensible" referred to convergent G(t).

        • madcaptenor 5 hours ago

          It doesn't mean that in general; I think the author is saying "for sensible v" somewhat informally to mean "for v for which this makes sense".

        • foldU 5 hours ago

          You also don't really have to worry about convergence, since these are formal power series.

    • coliveira 4 hours ago

      This is just the sum of a geometric sequence.