For those interested in looking slightly more into the characteristic function, it may be worth pointing out that the characteristic function is equal to the Fourier-transform (with the sign of the argument being reversed) of the probability distribution in question.
In my own experience teaching teaching probability theory to physicists and engineers, establishing this connection is often a good way of helping people build intuition for why characteristic functions are so useful, why they crop up everywhere in probability theory, and why we can extract so much useful information about a distribution by looking at the characteristic function (since this group of students tends to already be rather familiar with Fourier-transforms).
Yes, this provides good intuition about why it is useful: the PDF of the sum of two random variables is the convolution of the original PDFs. A convolution is awkward to work with, but by the convolution theorem it is a multiplication in the Fourier domain. This immediately suggests that the Fourier transform of a PDF would be a useful thing to work with.
If you don't say that this is what you are doing then it all seems quite mysterious.
As a physicist, the moment when everything just clicked was when I realised that connected Feynman diagrams were basically the cumulants of that distribution.
Then almost everything in physics is about "what is the characteristic/moment/cumulant generating function?" and associated Legendre transforms
I feel like it's almost criminal of textbook writers not to mention this when introducing the characteristic function... At least as an aside or a footnote, for readers already familiar with Fourier transforms.
but isn't a characteristic function just "the" way to bridge the gap between sets, functions, and logic(? ...a 3way bridge!?)
I mean, it was useful for me to think about like a translation between sets and logic (this variable x is in the set xor not) into functions (a function f(x) that returns 1 or true whenever x is in set S)
You're thinking of a "characteristic function" in the sense of "indicator function" of a subset (https://en.wikipedia.org/wiki/Indicator_function), which is different thing to the characteristic function of a probability density function.
- The characteristic function of a random variable X is defined as the function that maps t --> ExpectedValue[ exp( i * t * X ) ]
- Computing this expected value is the same as regarding t as a constant and integrating the function x --> exp( i * t * x) with respect to the distribution of X, i.e. if X has the density f, we compute the integral of f(x) * exp( i * t * x) with respect to x over the domain of f.
- on the other hand: computing the Fourier transform of f (here representing the density of X) and evaluating it at point t (i.e. computing (F(f))(t) if F represents the Fourier transform) is the same as fixing t and computing the integral of f(x) * exp( -i * t * x) with respect to x.
- Rearranging the integrand in the previous expression to f(x) * exp( i * -t * x), we see that it is the same as the integrand used in the characteristic function, only with a -t instead of a t.
“Characterstic function” is (was) an overloaded term.
What you described is more often referred to as an “indicator function” these days, with “characteristic functions” denoting the transform (Fourier, laplace, z - depending on context). Closely related to “moment generating functions” to the point of being almost interchangeable.
so the same thing but, characterisic function as I knew them before these posts is a rudimentary 2-variable finite version. point and line (but the line is a curve, a circle because e).
but the new and improved 21st century characteristic functions are n-variable and have a full continious spectrum of variables between zero (false) and one (true) but only potentially lest infinite realizes itself (which would make the theories illogical).
Probably worth noting that as we know know, polynomials (over a field) are a vector space, not just convertible to one. The set of formal variables { x^0, x^1, x^2, … } is an orthogonal basis.
"I have long struggled with understanding what probability-generating functions are and how to intuit them. There were two pieces of the puzzle missing for me, and we’ll go through both in this article."
Great article. For more, I really recommend Analytic Combinatorics:
Second this. This class is a classical example of conceptual blockbuster. Once one learns it, the complexity analysis of algorithms will never be the same again. In general, if a techie wants to spend their spare time learning new stuff, they will be better off focusing more on such conceptual stuff, as the return will compound over the years.
Yes, but the polynomial form generalizes to coefficients of an arbitrary field, not just naturals. If your vector were, say, [1.3, 2.197656, pi, -1/2, 3*2i] then there wouldn’t be a reasonable base you could pick for a place-value representation.
Probably worth mentioning the moment-generating function as well, since it's a bit more elementary than the characteristic function and shares many of the properties. It's also more simply related to the probability generating function, you can go from one to the other with a basic change of coordinates (t -> log(x)). I also estimate calculating the moment generating function to be easier in most cases.
In fact most properties of the PGF come from the monent-generating/characteristic function. Including why the second derivative is related to the variance. The second derivative of the moment generating function is the second moment E[X^2]. The second derivative of the logarithm of the MGF is the variance by definition.
The one property that's somewhat unique to the PGF is how composition relates to drawing a randomly-sized sample, which I can see could be useful.
One can also think of probability generating functions as (flipped) Z transforms, moment generating functions as (flipped Laplace transforms), and characteristic functions as Fourier transforms of the respective PMF/PDF. Lot of their properties then follow from simple properties of Signals and Systems.
Don't have a reference on the top of my head, but the main idea is as follows:
The definition of MGF of a random variable with PDF f(x) is
E[e^{sX}] = int_{-inf}^{inf} f(x) e^{sx} dx
The definition of Laplace Transform of a signal f(t) is
F(s) = int _{-inf}^{inf} f(t) e^{-st} dt
Hence MGF is 'flipped' Laplace transform
Now for we know that the MGF of sum independent RVs is the product of their MGFs. So if we take the inverse Laplace transform, the density of the sum is convolution of the individual densities.
Similarly, if we take derivative in frequency domain, that is same as multiplying in time domain: So M'_X(s) is the 'flipped Laplace transform' of x f(x) and its value at s=0 is the 'DC-gain' of the signal.
And so on... the properties are all immediate consequence of the definition of MGF and since the definition is essentially the same as that of a Laplace transform , there is an equivalent property in signals and systems as well.
For that, I would look at early calculus/pre-calc, I think, examining infinite series and their properties and equivalencies.
There's certain forms like that that have well known values that they converge to as you continue adding terms into infinity. Sometimes that convergence is only possible if your domain is limited, eg. [0,1].
For those interested in looking slightly more into the characteristic function, it may be worth pointing out that the characteristic function is equal to the Fourier-transform (with the sign of the argument being reversed) of the probability distribution in question.
In my own experience teaching teaching probability theory to physicists and engineers, establishing this connection is often a good way of helping people build intuition for why characteristic functions are so useful, why they crop up everywhere in probability theory, and why we can extract so much useful information about a distribution by looking at the characteristic function (since this group of students tends to already be rather familiar with Fourier-transforms).
Yes, this provides good intuition about why it is useful: the PDF of the sum of two random variables is the convolution of the original PDFs. A convolution is awkward to work with, but by the convolution theorem it is a multiplication in the Fourier domain. This immediately suggests that the Fourier transform of a PDF would be a useful thing to work with.
If you don't say that this is what you are doing then it all seems quite mysterious.
> the PDF of the sum of two random variables is the convolution of the original PDFs
(Probably obvious to everyone reading, but the variables should be independent.)
As a physicist, the moment when everything just clicked was when I realised that connected Feynman diagrams were basically the cumulants of that distribution. Then almost everything in physics is about "what is the characteristic/moment/cumulant generating function?" and associated Legendre transforms
I feel like it's almost criminal of textbook writers not to mention this when introducing the characteristic function... At least as an aside or a footnote, for readers already familiar with Fourier transforms.
I had not made that connection and find that incredibly useful. Thank you for pointing that out.
but isn't a characteristic function just "the" way to bridge the gap between sets, functions, and logic(? ...a 3way bridge!?)
I mean, it was useful for me to think about like a translation between sets and logic (this variable x is in the set xor not) into functions (a function f(x) that returns 1 or true whenever x is in set S)
how the heck is that a fourier transform!??
You're thinking of a "characteristic function" in the sense of "indicator function" of a subset (https://en.wikipedia.org/wiki/Indicator_function), which is different thing to the characteristic function of a probability density function.
You can think of it like this:
- The characteristic function of a random variable X is defined as the function that maps t --> ExpectedValue[ exp( i * t * X ) ]
- Computing this expected value is the same as regarding t as a constant and integrating the function x --> exp( i * t * x) with respect to the distribution of X, i.e. if X has the density f, we compute the integral of f(x) * exp( i * t * x) with respect to x over the domain of f.
- on the other hand: computing the Fourier transform of f (here representing the density of X) and evaluating it at point t (i.e. computing (F(f))(t) if F represents the Fourier transform) is the same as fixing t and computing the integral of f(x) * exp( -i * t * x) with respect to x.
- Rearranging the integrand in the previous expression to f(x) * exp( i * -t * x), we see that it is the same as the integrand used in the characteristic function, only with a -t instead of a t.
Hope that helps :)
“Characterstic function” is (was) an overloaded term.
What you described is more often referred to as an “indicator function” these days, with “characteristic functions” denoting the transform (Fourier, laplace, z - depending on context). Closely related to “moment generating functions” to the point of being almost interchangeable.
so the same thing but, characterisic function as I knew them before these posts is a rudimentary 2-variable finite version. point and line (but the line is a curve, a circle because e).
but the new and improved 21st century characteristic functions are n-variable and have a full continious spectrum of variables between zero (false) and one (true) but only potentially lest infinite realizes itself (which would make the theories illogical).
this way of thinking about this makes sense to me, even if it's ever so slighly wrong by some nitpickable point https://en.wikipedia.org/wiki/Moment-generating_function
https://en.m.wikipedia.org/wiki/Characteristic_function_(pro...
Herbert S. Wilf (1990): Generatingfunctionology. https://www2.math.upenn.edu/~wilf/gfology2.pdf
Probably worth noting that as we know know, polynomials (over a field) are a vector space, not just convertible to one. The set of formal variables { x^0, x^1, x^2, … } is an orthogonal basis.
"I have long struggled with understanding what probability-generating functions are and how to intuit them. There were two pieces of the puzzle missing for me, and we’ll go through both in this article."
Great article. For more, I really recommend Analytic Combinatorics:
https://ac.cs.princeton.edu/home/
Second this. This class is a classical example of conceptual blockbuster. Once one learns it, the complexity analysis of algorithms will never be the same again. In general, if a techie wants to spend their spare time learning new stuff, they will be better off focusing more on such conceptual stuff, as the return will compound over the years.
I've always wondered why the hell generating functions existed, and I think this line sums it up:
> When de Moivre invented much of modern probability in the mid-1700s, he didn’t have vectors! Vectors are an 1800s invention.
Doesn't explain why we still teach them 300 years later though. Thats what the second half of the article covers.
”If we want to encode the vector [6,2,8,4] in a single expression we can create a function containing those numbers: f(x) = 6 + 2x² + 8x³ + 4x⁴
…or if you flip the vector and use x=10:
Yes, but the polynomial form generalizes to coefficients of an arbitrary field, not just naturals. If your vector were, say, [1.3, 2.197656, pi, -1/2, 3*2i] then there wouldn’t be a reasonable base you could pick for a place-value representation.
Indeed. However, note that this is limited to encoding values between 0 and 9.
Is there a relationship between algebraic polynomial encoding of sequences, and https://en.wikipedia.org/wiki/G%C3%B6del_numbering_for_seque... ?
Does an encoding of a sequence in a given Gödel numbering, also somehow "retrievably" encode the probability space of the sequence's terms?
Probably worth mentioning the moment-generating function as well, since it's a bit more elementary than the characteristic function and shares many of the properties. It's also more simply related to the probability generating function, you can go from one to the other with a basic change of coordinates (t -> log(x)). I also estimate calculating the moment generating function to be easier in most cases.
In fact most properties of the PGF come from the monent-generating/characteristic function. Including why the second derivative is related to the variance. The second derivative of the moment generating function is the second moment E[X^2]. The second derivative of the logarithm of the MGF is the variance by definition.
The one property that's somewhat unique to the PGF is how composition relates to drawing a randomly-sized sample, which I can see could be useful.
One can also think of probability generating functions as (flipped) Z transforms, moment generating functions as (flipped Laplace transforms), and characteristic functions as Fourier transforms of the respective PMF/PDF. Lot of their properties then follow from simple properties of Signals and Systems.
Do you have a reference that explains this in more detail? I'd be curious to know.
Don't have a reference on the top of my head, but the main idea is as follows:
The definition of MGF of a random variable with PDF f(x) is
E[e^{sX}] = int_{-inf}^{inf} f(x) e^{sx} dx
The definition of Laplace Transform of a signal f(t) is
F(s) = int _{-inf}^{inf} f(t) e^{-st} dt
Hence MGF is 'flipped' Laplace transform
Now for we know that the MGF of sum independent RVs is the product of their MGFs. So if we take the inverse Laplace transform, the density of the sum is convolution of the individual densities.
Similarly, if we take derivative in frequency domain, that is same as multiplying in time domain: So M'_X(s) is the 'flipped Laplace transform' of x f(x) and its value at s=0 is the 'DC-gain' of the signal.
And so on... the properties are all immediate consequence of the definition of MGF and since the definition is essentially the same as that of a Laplace transform , there is an equivalent property in signals and systems as well.
I rather like the derivation of the CLT using MGFs.
https://courses.cs.washington.edu/courses/cse312/20su/files/...
Generating functions are also a great tool in combinatorics (see for example the book Analytic Combinatorics by Flajolet and Sedgewick).
Tangentially related: http://bactra.org/notebooks/cumulants.html
I think the 12th order G(t) example is missing another term with coefficient 1/5 -- since these coefficients must sum to 1
Unless it’s already been fixed, the queen’s term (t^12) has 2/5 in the article.
Lost the plot at "... and for sensible v, this is equivalent to ..." :(
For that, I would look at early calculus/pre-calc, I think, examining infinite series and their properties and equivalencies.
There's certain forms like that that have well known values that they converge to as you continue adding terms into infinity. Sometimes that convergence is only possible if your domain is limited, eg. [0,1].
That clears it, thanks! I didn't figure out that "sensible" referred to convergent G(t).
It doesn't mean that in general; I think the author is saying "for sensible v" somewhat informally to mean "for v for which this makes sense".
You also don't really have to worry about convergence, since these are formal power series.
This is just the sum of a geometric sequence.