1. Maximum Likelihood Estimation

Maximum Likelihood Estimation (MLE) ¹² is a fundamental statistical method for estimating the parameters of a probability distribution or statistical model. Introduced by Ronald Fisher in the 1920s, MLE provides a principled and widely applicable framework for parameter estimation with desirable theoretical properties.

The core principle is elegant: choose parameter values that make the observed data most probable.

1.1 The Likelihood Function

1.1.1 Definition

Suppose we have observed data \(\mathbf{x} = (x_1, x_2, \ldots, x_n)\) that are assumed to be realizations from a probability distribution with probability density function (or mass function) \(f(x; \theta)\), where \(\theta \in \Theta\) is an unknown parameter (or parameter vector).

The likelihood function is defined as:

\[ \begin{equation} L(\theta; \mathbf{x}) = L(\theta; x_1, \ldots, x_n) = \prod_{i=1}^{n} f(x_i; \theta) \label{eq:likelihood_function} \end{equation} \]

assuming the observations are independent and identically distributed (i.i.d.).

Understanding the Likelihood Function

Key conceptual shift: While \(f(x; \theta)\) is a probability density as a function of \(x\) for fixed \(\theta\), the likelihood \(L(\theta; \mathbf{x})\) is a function of \(\theta\) for fixed observed data \(\mathbf{x}\).

Probability perspective: \(f(x; \theta)\) tells us how probable different data values are for a given parameter
Likelihood perspective: \(L(\theta; \mathbf{x})\) tells us how plausible different parameter values are given the observed data

Important: The likelihood is not a probability distribution over \(\theta\). It does not integrate to 1, and \(\theta\) is treated as an unknown constant, not a random variable.

1.1.2 Properties of the Likelihood Function

Non-negativity:

\[ \begin{equation} L(\theta; \mathbf{x}) \geq 0 \quad \text{for all } \theta \in \Theta \label{eq:likelihood_nonnegative} \end{equation} \]

Scale invariance: The likelihood can be multiplied by any positive constant without changing the location of its maximum. This motivates working with the log-likelihood.

1.2 The Log-Likelihood Function

1.2.1 Definition

The log-likelihood function is the natural logarithm of the likelihood function:

\[ \begin{equation} \ell(\theta; \mathbf{x}) = \ln L(\theta; \mathbf{x}) = \sum_{i=1}^{n} \ln f(x_i; \theta) \label{eq:log_likelihood} \end{equation} \]

1.2.2 Why Use the Log-Likelihood?

There are several compelling reasons to work with \(\ell(\theta)\) instead of \(L(\theta)\):

Computational stability: Products of many small probabilities can cause numerical underflow; sums are more stable
Mathematical convenience: The product in \(\ref{eq:likelihood_function}\) becomes a sum in \(\ref{eq:log_likelihood}\), simplifying differentiation
Preservation of maxima: Since \(\ln(\cdot)\) is a strictly increasing function:

\[ \begin{equation} \arg\max_{\theta} L(\theta; \mathbf{x}) = \arg\max_{\theta} \ell(\theta; \mathbf{x}) \label{eq:argmax_equivalence} \end{equation} \]

Asymptotic theory: Many theoretical results are more naturally expressed in terms of the log-likelihood

1.3 The Maximum Likelihood Estimator

1.3.1 Definition

The maximum likelihood estimator (MLE) \(\hat{\theta}_{MLE}\) is the value of \(\theta\) that maximizes the likelihood (or equivalently, the log-likelihood):

\[ \begin{equation} \hat{\theta}_{MLE} = \arg\max_{\theta \in \Theta} L(\theta; \mathbf{x}) = \arg\max_{\theta \in \Theta} \ell(\theta; \mathbf{x}) \label{eq:mle_definition} \end{equation} \]

1.3.2 Finding the MLE

For differentiable likelihoods, the MLE can often be found by solving the score equations:

\[ \begin{equation} \frac{\partial \ell(\theta; \mathbf{x})}{\partial \theta} = 0 \label{eq:score_equation} \end{equation} \]

The derivative \(\frac{\partial \ell(\theta; \mathbf{x})}{\partial \theta}\) is called the score function.

Derivation of First-Order Condition

For a scalar parameter \(\theta\), the first-order necessary condition for an interior maximum is:

\[ \frac{\partial \ell(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta}\sum_{i=1}^{n}\ln f(x_i; \theta) = \sum_{i=1}^{n}\frac{\partial \ln f(x_i; \theta)}{\partial \theta} = 0 \]

Using the chain rule:

\[ \frac{\partial \ln f(x_i; \theta)}{\partial \theta} = \frac{1}{f(x_i; \theta)}\frac{\partial f(x_i; \theta)}{\partial \theta} \]

Therefore, the score equation becomes:

\[ \sum_{i=1}^{n}\frac{1}{f(x_i; \theta)}\frac{\partial f(x_i; \theta)}{\partial \theta} = \sum_{i=1}^{n}\frac{\partial \ln f(x_i; \theta)}{\partial \theta} = 0 \]

For vector parameters \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)^\top\), we have a system of \(k\) equations:

\[ \frac{\partial \ell(\boldsymbol{\theta})}{\partial \theta_j} = 0, \quad j = 1, \ldots, k \]

or in vector notation:

\[ \nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta}) = \mathbf{0} \]

1.3.3 Second-Order Condition

To verify that a critical point is indeed a maximum, check the second-order condition:

For scalar \(\theta\):

\[ \begin{equation} \frac{\partial^2 \ell(\theta)}{\partial \theta^2}\bigg|_{\theta=\hat{\theta}_{MLE}} < 0 \label{eq:second_order_scalar} \end{equation} \]

For vector \(\boldsymbol{\theta}\), the Hessian matrix must be negative definite:

\[ \begin{equation} \mathbf{H}(\hat{\boldsymbol{\theta}}_{MLE}) = \left[\frac{\partial^2 \ell(\boldsymbol{\theta})}{\partial \theta_i \partial \theta_j}\right]_{\boldsymbol{\theta}=\hat{\boldsymbol{\theta}}_{MLE}} \prec 0 \label{eq:hessian_negative_definite} \end{equation} \]

1.4 Fundamental Properties of MLEs

1.4.1 1. Consistency

Under regularity conditions, the MLE is consistent: as the sample size \(n \to \infty\), the MLE converges in probability to the true parameter value:

\[ \begin{equation} \hat{\theta}_{MLE} \xrightarrow{p} \theta_0 \quad \text{as } n \to \infty \label{eq:mle_consistency} \end{equation} \]

where \(\theta_0\) is the true parameter value.

1.4.2 2. Asymptotic Normality

Under regularity conditions, the MLE has an asymptotically normal distribution:

\[ \begin{equation} \sqrt{n}(\hat{\theta}_{MLE} - \theta_0) \xrightarrow{d} \mathcal{N}(0, I^{-1}(\theta_0)) \label{eq:mle_asymptotic_normality} \end{equation} \]

where \(I(\theta_0)\) is the Fisher information:

\[ \begin{equation} I(\theta) = -\mathbb{E}\left[\frac{\partial^2 \ln f(X; \theta)}{\partial \theta^2}\right] \label{eq:fisher_information} \end{equation} \]

Alternative Expression for Fisher Information

The Fisher information can also be expressed as:

\[ I(\theta) = \mathbb{E}\left[\left(\frac{\partial \ln f(X; \theta)}{\partial \theta}\right)^2\right] \]

Proof of equivalence (for scalar \(\theta\)):

Starting with the fact that \(\int f(x; \theta)dx = 1\) for all \(\theta\), differentiate both sides with respect to \(\theta\):

\[ \frac{\partial}{\partial \theta}\int f(x; \theta)dx = \int \frac{\partial f(x; \theta)}{\partial \theta}dx = 0 \]

Multiply and divide by \(f(x; \theta)\):

\[ \int \frac{1}{f(x; \theta)}\frac{\partial f(x; \theta)}{\partial \theta} f(x; \theta)dx = \int \frac{\partial \ln f(x; \theta)}{\partial \theta} f(x; \theta)dx = \mathbb{E}\left[\frac{\partial \ln f(X; \theta)}{\partial \theta}\right] = 0 \]

Differentiating once more with respect to \(\theta\):

\[ \frac{\partial}{\partial \theta}\int \frac{\partial \ln f(x; \theta)}{\partial \theta} f(x; \theta)dx = 0 \]

Applying the product rule:

\[ \int \left[\frac{\partial^2 \ln f(x; \theta)}{\partial \theta^2} f(x; \theta) + \frac{\partial \ln f(x; \theta)}{\partial \theta}\frac{\partial f(x; \theta)}{\partial \theta}\right]dx = 0 \]

\[ \int \left[\frac{\partial^2 \ln f(x; \theta)}{\partial \theta^2} f(x; \theta) + \left(\frac{\partial \ln f(x; \theta)}{\partial \theta}\right)^2 f(x; \theta)\right]dx = 0 \]

Taking expectations:

\[ \mathbb{E}\left[\frac{\partial^2 \ln f(X; \theta)}{\partial \theta^2}\right] + \mathbb{E}\left[\left(\frac{\partial \ln f(X; \theta)}{\partial \theta}\right)^2\right] = 0 \]

Rearranging:

\[ I(\theta) = -\mathbb{E}\left[\frac{\partial^2 \ln f(X; \theta)}{\partial \theta^2}\right] = \mathbb{E}\left[\left(\frac{\partial \ln f(X; \theta)}{\partial \theta}\right)^2\right] \]

1.4.3 3. Efficiency (Cramér-Rao Lower Bound)

Under regularity conditions, the MLE achieves the Cramér-Rao lower bound asymptotically, meaning it has the smallest possible asymptotic variance among consistent estimators:

\[ \begin{equation} \text{Var}(\hat{\theta}_{MLE}) \geq \frac{1}{nI(\theta_0)} \quad \text{(with equality asymptotically)} \label{eq:cramer_rao_bound} \end{equation} \]

This makes the MLE asymptotically efficient.

1.4.4 4. Invariance Property

If \(\hat{\theta}_{MLE}\) is the MLE of \(\theta\), and \(g(\cdot)\) is any function, then the MLE of \(\tau = g(\theta)\) is:

\[ \begin{equation} \hat{\tau}_{MLE} = g(\hat{\theta}_{MLE}) \label{eq:mle_invariance} \end{equation} \]

This property is extremely useful in practice, as it allows us to obtain MLEs of transformed parameters directly.

1.5 Standard Errors and Confidence Intervals

1.5.1 Observed Fisher Information

In practice, we estimate the variance of the MLE using the observed Fisher information:

\[ \begin{equation} \hat{I}(\hat{\theta}_{MLE}) = -\frac{\partial^2 \ell(\theta)}{\partial \theta^2}\bigg|_{\theta=\hat{\theta}_{MLE}} \label{eq:observed_fisher_information} \end{equation} \]

1.5.2 Standard Error

The standard error of the MLE is:

\[ \begin{equation} \text{SE}(\hat{\theta}_{MLE}) = \frac{1}{\sqrt{n\hat{I}(\hat{\theta}_{MLE})}} \label{eq:mle_standard_error} \end{equation} \]

For vector parameters:

\[ \begin{equation} \text{Var}(\hat{\boldsymbol{\theta}}_{MLE}) \approx \left[-\mathbf{H}(\hat{\boldsymbol{\theta}}_{MLE})\right]^{-1} \label{eq:mle_covariance_matrix} \end{equation} \]

1.5.3 Asymptotic Confidence Intervals

An approximate \((1-\alpha)\) confidence interval for \(\theta\) is:

\[ \begin{equation} \hat{\theta}_{MLE} \pm z_{\alpha/2} \cdot \text{SE}(\hat{\theta}_{MLE}) \label{eq:mle_confidence_interval} \end{equation} \]

where \(z_{\alpha/2}\) is the \((1-\alpha/2)\) quantile of the standard normal distribution.

1.6 Worked Example: Normal Distribution

Let's derive the MLE for the mean \(\mu\) and variance \(\sigma^2\) of a normal distribution, given i.i.d. observations \(x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)\).

1.6.1 Step 1: Write the Likelihood

The probability density function is:

\[ \begin{equation} f(x_i; \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) \label{eq:normal_pdf} \end{equation} \]

The likelihood function is:

\[ \begin{equation} L(\mu, \sigma^2; \mathbf{x}) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x_i - \mu)^2}{2\sigma^2}\right) \label{eq:normal_likelihood} \end{equation} \]

1.6.2 Step 2: Take the Log-Likelihood

\[ \begin{equation} \begin{aligned} \ell(\mu, \sigma^2; \mathbf{x}) &= \ln L(\mu, \sigma^2; \mathbf{x}) \\ &= \sum_{i=1}^{n}\left[\ln\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) - \frac{(x_i - \mu)^2}{2\sigma^2}\right] \\ &= -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2 \end{aligned} \label{eq:normal_log_likelihood} \end{equation} \]

1.6.3 Step 3: Find the Score Equations

For \(\mu\):

\[ \begin{equation} \frac{\partial \ell}{\partial \mu} = -\frac{1}{2\sigma^2}\sum_{i=1}^{n}2(x_i - \mu)(-1) = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) \label{eq:normal_score_mu} \end{equation} \]

Setting \(\ref{eq:normal_score_mu}\) equal to zero:

\[ \begin{equation} \sum_{i=1}^{n}(x_i - \hat{\mu}) = 0 \quad \Rightarrow \quad \hat{\mu}_{MLE} = \frac{1}{n}\sum_{i=1}^{n}x_i = \bar{x} \label{eq:normal_mle_mu} \end{equation} \]

For \(\sigma^2\):

\[ \begin{equation} \frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^{n}(x_i - \mu)^2 \label{eq:normal_score_sigma2} \end{equation} \]

Setting \(\ref{eq:normal_score_sigma2}\) equal to zero and substituting \(\hat{\mu}_{MLE}\):

\[ \begin{equation} -\frac{n}{2\hat{\sigma}^2} + \frac{1}{2(\hat{\sigma}^2)^2}\sum_{i=1}^{n}(x_i - \bar{x})^2 = 0 \label{eq:normal_score_sigma2_zero} \end{equation} \]

Solving for \(\hat{\sigma}^2\):

\[ \begin{equation} \hat{\sigma}^2_{MLE} = \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2 \label{eq:normal_mle_sigma2} \end{equation} \]

1.6.4 Step 4: Verify Second-Order Conditions

The Hessian matrix must be negative definite at \((\hat{\mu}_{MLE}, \hat{\sigma}^2_{MLE})\).

\[ \begin{equation} \frac{\partial^2 \ell}{\partial \mu^2} = -\frac{n}{\sigma^2} < 0 \label{eq:normal_hessian_mu} \end{equation} \]

\[ \begin{equation} \frac{\partial^2 \ell}{\partial (\sigma^2)^2} = \frac{n}{2(\sigma^2)^2} - \frac{1}{(\sigma^2)^3}\sum_{i=1}^{n}(x_i - \mu)^2 \label{eq:normal_hessian_sigma2} \end{equation} \]

At the MLE, this is negative (verification omitted for brevity).

1.6.5 Result Summary

The MLEs for the normal distribution are:

\[ \begin{equation} \begin{aligned} \hat{\mu}_{MLE} &= \bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \\ \hat{\sigma}^2_{MLE} &= \frac{1}{n}\sum_{i=1}^{n}(x_i - \bar{x})^2 \end{aligned} \label{eq:normal_mle_summary} \end{equation} \]

Note: \(\hat{\sigma}^2_{MLE}\) is biased for finite \(n\). The unbiased estimator is \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\).

1.7 Numerical Optimization

When the score equations cannot be solved analytically, numerical optimization is required. Several methods are available, each with different characteristics.

1.7.1 Newton-Raphson Method

The Newton-Raphson algorithm iteratively updates the parameter estimate:

\[ \begin{equation} \theta^{(k+1)} = \theta^{(k)} - \left[\frac{\partial^2 \ell(\theta^{(k)})}{\partial \theta^2}\right]^{-1}\frac{\partial \ell(\theta^{(k)})}{\partial \theta} \label{eq:newton_raphson} \end{equation} \]

For vector parameters:

\[ \begin{equation} \boldsymbol{\theta}^{(k+1)} = \boldsymbol{\theta}^{(k)} - \mathbf{H}(\boldsymbol{\theta}^{(k)})^{-1}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\theta}^{(k)}) \label{eq:newton_raphson_vector} \end{equation} \]

Advantages: Quadratic convergence (very fast near optimum), few iterations needed

Limitations: Requires Hessian computation (\(O(n^3)\) cost), sensitive to initialization

1.7.2 Other Optimization Methods

Depending on the problem characteristics, other methods may be more suitable:

Gradient Descent: Uses only first derivatives; linear convergence but robust and scalable to large problems
BFGS: Quasi-Newton method approximating the Hessian using gradient information; superlinear convergence with \(O(n^2)\) cost
Nelder-Mead: Derivative-free simplex method; works when gradients are unavailable or unreliable
Expectation-Maximization: Specifically designed for models with latent variables; monotonically increases likelihood

See Numerical Methods for comprehensive coverage of these algorithms.

1.8 Regularity Conditions

The desirable properties of MLEs hold under certain regularity conditions:

Identifiability: Different parameter values yield different distributions
Common support: The support of \(f(x; \theta)\) does not depend on \(\theta\)
Differentiability: The likelihood is twice continuously differentiable in \(\theta\)
Fisher information: \(0 < I(\theta) < \infty\) for all \(\theta\)
Uniform integrability: Exchange of derivatives and integrals is justified

These conditions are satisfied for most common parametric families but may fail in certain cases (e.g., uniform distribution with unknown endpoints).

1.9 Comparison with Other Estimators

1.9.1 Method of Moments (MoM)

MLE: Maximizes likelihood; asymptotically efficient
MoM: Matches sample moments to population moments; simpler but less efficient

1.9.2 Bayesian Estimation

MLE: Treats \(\theta\) as a fixed unknown; uses only the likelihood
MAP (Maximum A Posteriori): Incorporates prior distribution \(\pi(\theta)\); maximizes posterior \(\pi(\theta|\mathbf{x}) \propto L(\theta;\mathbf{x})\pi(\theta)\)

With a uniform (non-informative) prior, MAP reduces to MLE.

1.9.3 Least Squares

For linear regression with normal errors, OLS and MLE coincide. In general MLE applies more broadly.

1.10 Advantages and Limitations

1.10.1 Advantages

Generality: Applicable to virtually any parametric model
Optimality: Asymptotically efficient under regularity conditions
Invariance: Transformations of parameters are handled automatically
Inference: Standard framework for hypothesis tests and confidence intervals

1.10.2 Limitations

Requires model specification: Sensitive to distributional assumptions
Computational challenges: May require numerical optimization
Small sample bias: Finite-sample properties can differ from asymptotic theory
Boundary issues: MLEs may lie on the boundary of the parameter space
Non-regularity: Properties may not hold if regularity conditions fail

1.11 Hypothesis Testing with MLEs

1.11.1 Likelihood Ratio Test

The likelihood ratio test statistic for testing \(H_0: \theta = \theta_0\) versus \(H_1: \theta \neq \theta_0\) is:

\[ \begin{equation} \Lambda = -2\ln\left(\frac{L(\theta_0; \mathbf{x})}{L(\hat{\theta}_{MLE}; \mathbf{x})}\right) = 2[\ell(\hat{\theta}_{MLE}) - \ell(\theta_0)] \label{eq:likelihood_ratio_test} \end{equation} \]

Under \(H_0\), \(\Lambda \xrightarrow{d} \chi^2_1\) as \(n \to \infty\).

1.11.2 Wald Test

The Wald test statistic is:

\[ \begin{equation} W = \frac{(\hat{\theta}_{MLE} - \theta_0)^2}{\text{Var}(\hat{\theta}_{MLE})} \xrightarrow{d} \chi^2_1 \label{eq:wald_test} \end{equation} \]

1.11.3 Score Test (Lagrange Multiplier Test)

The score test statistic is:

\[ \begin{equation} S = \frac{\left[\frac{\partial \ell(\theta_0)}{\partial \theta}\right]^2}{I(\theta_0)} \xrightarrow{d} \chi^2_1 \label{eq:score_test} \end{equation} \]

All three tests are asymptotically equivalent but may differ in finite samples.

1.12 Practical Guidelines

Check model assumptions: Ensure the distributional assumptions are reasonable for your data
Examine the log-likelihood surface: Plot \(\ell(\theta)\) to check for multiple maxima or flat regions
Use good starting values: For numerical optimization, choose sensible initial values (e.g., from method of moments)
Verify convergence: Check that optimization algorithms have converged properly
Assess sensitivity: Examine how estimates change with small perturbations to the data
Report uncertainty: Always provide standard errors or confidence intervals
Validate the model: Use residual diagnostics and goodness-of-fit tests

Building confidence through rigorous validation

George Casella and Roger L. Berger. Statistical Inference. Duxbury Press, 2nd edition, 2002. URL: https://www.cengage.com/c/statistical-inference-2e-casella/9780534243128/. ↩
Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004. doi:10.1007/978-0-387-21736-9. ↩