1. Statistics And Probability - Moment Generating Functions
This entry builds intuition and detailed derivations for Moment Generating Functions (MGFs). Since many statistical derivations rely on MGFs throughout the field, we present the concepts and steps in an explicit, accessible way so you (and me) can return to this reference whenever needed and quickly recall the essential techniques.
1.1 What is an MGF?
Let \(X\) be a random variable. The moment generating function is defined as
Why this definition?
Because if you expand \(e^{tX}\) into a series, you will see the moments of \(X\) appear one by one. This property makes MGFs a powerful tool for deriving and manipulating moments without computing integrals or sums directly. By taking derivatives of \(M_X(t)\) at \(t=0\), you can extract any moment of the distribution.
1.2 How moments appear inside the MGF
Start with the Taylor expansion of \(e^{tX}\):
How and Why of \(\ref{eq:mgf_taylor_expansion}\)
The Taylor Series Foundation:
Here \(X\) is a scalar random variable, and we use the standard Taylor series for the exponential function:
Applying it to \(e^{tX}\):
When we set \(u = tX\), we get:
- When \(t=0\): \(\exp(0) = 1\)
- When \(X\) is a random variable, each term \(X^n\) is also a random variable
- Taking expectations term-by-term (assuming we can swap \(\mathbb{E}\) and \(\sum\)) gives us the moments of \(X\)
Take expectation on both sides:
Jensen could be really mad if you forget that ...
... the assumtion in \(\ref{eq:mgf_series_moments}\) applies exclusively to linear functions. In another cases where the function is non-linear, the expression requires an adjustment given the Jensen's Inequality.
That means the first derivative at \(t=0\) gives the mean, while the second derivative at \(t=0\) gives the second moment, and the same pattern continues for higher moments.
Formally,
1.3 Examples
The expected value \(\mathbb{E}[X]\) is defined differently depending on whether \(X\) is discrete or continuous, but the concept is the same: a weighted average of all possible values.
For discrete \(X\):
Each possible value \(k\) is weighted by its probability \(\mathbb{P}(X=k)\). The sum accumulates these weighted contributions.
For continuous \(X\):
Each infinitesimal value \(x\) is weighted by its probability density \(f_X(x)\). The integral accumulates these weighted contributions over all real numbers.
The connection: Both operations, the sum and the integral. are computing a weighted average. The sum groups discrete outcomes; the integral groups infinitely many continuous outcomes. When you see \(\mathbb{E}[g(X)]\), replace the sum or integral with the same pattern: apply the function \(g\) to each outcome, weight it by probability (or density), then sum or integrate.
1.3.1 Bernoulli(\(p\))
Let \(X\in\{0,1\}\) with \(\mathbb{P}(X=1)=p\) and \(\mathbb{P}(X=0)=1-p\).
Compute the MGF directly from the definition:
Why this formula for \(\mathbb{E}[e^{tX}]\)?
For a discrete random variable, the expected value is computed by summing the function values weighted by their probabilities. In general, \(\mathbb{E}[g(X)]=\sum_{k} g(k)\mathbb{P}(X=k)\). Here \(g(X)=e^{tX}\), and \(X\) takes only two values: \(0\) and \(1\). So we get
This is the definition of expected value for discrete distributions applied to the Bernoulli case.
Check the mean from the derivative:
Why does \(M_X'(0)\) equal the mean?
Recall from \(\ref{eq:mgf_series_moments}\) that the MGF can be written as
When we take the derivative with respect to \(t\), we get
Evaluating at \(t=0\) makes all terms with \(t\) vanish, leaving only \(M_X'(0)=\mathbb{E}[X]\), the first moment (mean).
For the Bernoulli case specifically, we can verify this directly: \(\mathbb{E}[X]=0\cdot(1-p)+1\cdot p=p\), which matches \(M_X'(0)=p\).
1.3.2 Normal \(N(\mu,\sigma^2)\)
Let \(X\sim N(\mu,\sigma^2)\). We compute
Why this integral for \(\mathbb{E}[e^{tX}]\)?
For a continuous random variable with probability density function (PDF) \(f_X(x)\), the expected value of any function \(g(X)\) is computed by integrating the function times the PDF:
In our case, \(g(X)=e^{tX}\) and the PDF of \(X\sim N(\mu,\sigma^2)\) is
So the MGF becomes the integral shown below: we multiply \(e^{tx}\) by the Gaussian PDF and integrate over all \(x\).
Combine the exponent terms and complete the square:
Now the integral becomes the integral of a normal PDF, which equals 1. So,
Extracting the mean from \(M_X(t)\)
We have \(M_X(t)=\exp\left(\mu t+\frac{1}{2}\sigma^2 t^2\right)\). To find the mean, take the derivative:
Evaluating at \(t=0\) gives
This confirms that the first derivative at \(t=0\) recovers the mean parameter \(\mu\) of the normal distribution. This follows the general principle from \(\ref{eq:mgf_derivatives_moments}\): the first derivative at zero always gives \(\mathbb{E}[X]\).
1.4 Sum of independent variables
If \(X\) and \(Y\) are independent, then
This is one of the most useful properties of MGFs. It lets you find the distribution of sums by multiplying MGFs.
1.5 Why MGFs are useful
MGFs encode all moments in a single function, which makes them a compact summary of the distribution. They also make sums easy to handle in the independent case, because multiplying MGFs corresponds to adding random variables. Finally, they can identify distributions: if two variables share the same MGF in a neighborhood of zero, they have the same distribution.
1.6 Quick checklist for practice
Try to compute \(M_X(t)\) for a discrete variable using a sum and for a continuous variable using an integral, and then recover the mean and variance by differentiating \(M_X(t)\) at \(t=0\). If you can do those three tasks, the core mechanics of MGFs will feel natural.
Building confidence through rigorous validation