1. Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) 12 is a fundamental statistical method for estimating the parameters of a probability distribution or statistical model. Introduced by Ronald Fisher in the 1920s, MLE provides a principled and widely applicable framework for parameter estimation with desirable theoretical properties.
The core principle is elegant: choose parameter values that make the observed data most probable.
1.1 The Likelihood Function
1.1.1 Definition
Suppose we have observed data \(\mathbf{x} = (x_1, x_2, \ldots, x_n)\) that are assumed to be realizations from a probability distribution with probability density function (or mass function) \(f(x; \theta)\), where \(\theta \in \Theta\) is an unknown parameter (or parameter vector).
The likelihood function is defined as:
assuming the observations are independent and identically distributed (i.i.d.).
Understanding the Likelihood Function
Key conceptual shift: While \(f(x; \theta)\) is a probability density as a function of \(x\) for fixed \(\theta\), the likelihood \(L(\theta; \mathbf{x})\) is a function of \(\theta\) for fixed observed data \(\mathbf{x}\).
- Probability perspective: \(f(x; \theta)\) tells us how probable different data values are for a given parameter
- Likelihood perspective: \(L(\theta; \mathbf{x})\) tells us how plausible different parameter values are given the observed data
Important: The likelihood is not a probability distribution over \(\theta\). It does not integrate to 1, and \(\theta\) is treated as an unknown constant, not a random variable.
1.1.2 Properties of the Likelihood Function
Non-negativity:
Scale invariance: The likelihood can be multiplied by any positive constant without changing the location of its maximum. This motivates working with the log-likelihood.
1.2 The Log-Likelihood Function
1.2.1 Definition
The log-likelihood function is the natural logarithm of the likelihood function:
1.2.2 Why Use the Log-Likelihood?
There are several compelling reasons to work with \(\ell(\theta)\) instead of \(L(\theta)\):
-
Computational stability: Products of many small probabilities can cause numerical underflow; sums are more stable
-
Mathematical convenience: The product in \(\ref{eq:likelihood_function}\) becomes a sum in \(\ref{eq:log_likelihood}\), simplifying differentiation
-
Preservation of maxima: Since \(\ln(\cdot)\) is a strictly increasing function:
- Asymptotic theory: Many theoretical results are more naturally expressed in terms of the log-likelihood
1.3 The Maximum Likelihood Estimator
1.3.1 Definition
The maximum likelihood estimator (MLE) \(\hat{\theta}_{MLE}\) is the value of \(\theta\) that maximizes the likelihood (or equivalently, the log-likelihood):
1.3.2 Finding the MLE
For differentiable likelihoods, the MLE can often be found by solving the score equations:
The derivative \(\frac{\partial \ell(\theta; \mathbf{x})}{\partial \theta}\) is called the score function.
Derivation of First-Order Condition
For a scalar parameter \(\theta\), the first-order necessary condition for an interior maximum is:
Using the chain rule:
Therefore, the score equation becomes:
For vector parameters \(\boldsymbol{\theta} = (\theta_1, \ldots, \theta_k)^\top\), we have a system of \(k\) equations:
or in vector notation:
1.3.3 Second-Order Condition
To verify that a critical point is indeed a maximum, check the second-order condition:
For scalar \(\theta\):
For vector \(\boldsymbol{\theta}\), the Hessian matrix must be negative definite:
1.4 Fundamental Properties of MLEs
1.4.1 1. Consistency
Under regularity conditions, the MLE is consistent: as the sample size \(n \to \infty\), the MLE converges in probability to the true parameter value:
where \(\theta_0\) is the true parameter value.
1.4.2 2. Asymptotic Normality
Under regularity conditions, the MLE has an asymptotically normal distribution:
where \(I(\theta_0)\) is the Fisher information:
Alternative Expression for Fisher Information
The Fisher information can also be expressed as:
Proof of equivalence (for scalar \(\theta\)):
Starting with the fact that \(\int f(x; \theta)dx = 1\) for all \(\theta\), differentiate both sides with respect to \(\theta\):
Multiply and divide by \(f(x; \theta)\):
Differentiating once more with respect to \(\theta\):
Applying the product rule:
Taking expectations:
Rearranging:
1.4.3 3. Efficiency (Cramér-Rao Lower Bound)
Under regularity conditions, the MLE achieves the Cramér-Rao lower bound asymptotically, meaning it has the smallest possible asymptotic variance among consistent estimators:
This makes the MLE asymptotically efficient.
1.4.4 4. Invariance Property
If \(\hat{\theta}_{MLE}\) is the MLE of \(\theta\), and \(g(\cdot)\) is any function, then the MLE of \(\tau = g(\theta)\) is:
This property is extremely useful in practice, as it allows us to obtain MLEs of transformed parameters directly.
1.5 Standard Errors and Confidence Intervals
1.5.1 Observed Fisher Information
In practice, we estimate the variance of the MLE using the observed Fisher information:
1.5.2 Standard Error
The standard error of the MLE is:
For vector parameters:
1.5.3 Asymptotic Confidence Intervals
An approximate \((1-\alpha)\) confidence interval for \(\theta\) is:
where \(z_{\alpha/2}\) is the \((1-\alpha/2)\) quantile of the standard normal distribution.
1.6 Worked Example: Normal Distribution
Let's derive the MLE for the mean \(\mu\) and variance \(\sigma^2\) of a normal distribution, given i.i.d. observations \(x_1, \ldots, x_n \sim \mathcal{N}(\mu, \sigma^2)\).
1.6.1 Step 1: Write the Likelihood
The probability density function is:
The likelihood function is:
1.6.2 Step 2: Take the Log-Likelihood
1.6.3 Step 3: Find the Score Equations
For \(\mu\):
Setting \(\ref{eq:normal_score_mu}\) equal to zero:
For \(\sigma^2\):
Setting \(\ref{eq:normal_score_sigma2}\) equal to zero and substituting \(\hat{\mu}_{MLE}\):
Solving for \(\hat{\sigma}^2\):
1.6.4 Step 4: Verify Second-Order Conditions
The Hessian matrix must be negative definite at \((\hat{\mu}_{MLE}, \hat{\sigma}^2_{MLE})\).
At the MLE, this is negative (verification omitted for brevity).
1.6.5 Result Summary
The MLEs for the normal distribution are:
Note: \(\hat{\sigma}^2_{MLE}\) is biased for finite \(n\). The unbiased estimator is \(s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2\).
1.7 Numerical Optimization
When the score equations cannot be solved analytically, numerical optimization is required. Several methods are available, each with different characteristics.
1.7.1 Newton-Raphson Method
The Newton-Raphson algorithm iteratively updates the parameter estimate:
For vector parameters:
Advantages: Quadratic convergence (very fast near optimum), few iterations needed
Limitations: Requires Hessian computation (\(O(n^3)\) cost), sensitive to initialization
1.7.2 Other Optimization Methods
Depending on the problem characteristics, other methods may be more suitable:
- Gradient Descent: Uses only first derivatives; linear convergence but robust and scalable to large problems
- BFGS: Quasi-Newton method approximating the Hessian using gradient information; superlinear convergence with \(O(n^2)\) cost
- Nelder-Mead: Derivative-free simplex method; works when gradients are unavailable or unreliable
- Expectation-Maximization: Specifically designed for models with latent variables; monotonically increases likelihood
See Numerical Methods for comprehensive coverage of these algorithms.
1.8 Regularity Conditions
The desirable properties of MLEs hold under certain regularity conditions:
- Identifiability: Different parameter values yield different distributions
- Common support: The support of \(f(x; \theta)\) does not depend on \(\theta\)
- Differentiability: The likelihood is twice continuously differentiable in \(\theta\)
- Fisher information: \(0 < I(\theta) < \infty\) for all \(\theta\)
- Uniform integrability: Exchange of derivatives and integrals is justified
These conditions are satisfied for most common parametric families but may fail in certain cases (e.g., uniform distribution with unknown endpoints).
1.9 Comparison with Other Estimators
1.9.1 Method of Moments (MoM)
- MLE: Maximizes likelihood; asymptotically efficient
- MoM: Matches sample moments to population moments; simpler but less efficient
1.9.2 Bayesian Estimation
- MLE: Treats \(\theta\) as a fixed unknown; uses only the likelihood
- MAP (Maximum A Posteriori): Incorporates prior distribution \(\pi(\theta)\); maximizes posterior \(\pi(\theta|\mathbf{x}) \propto L(\theta;\mathbf{x})\pi(\theta)\)
With a uniform (non-informative) prior, MAP reduces to MLE.
1.9.3 Least Squares
For linear regression with normal errors, OLS and MLE coincide. In general MLE applies more broadly.
1.10 Advantages and Limitations
1.10.1 Advantages
- Generality: Applicable to virtually any parametric model
- Optimality: Asymptotically efficient under regularity conditions
- Invariance: Transformations of parameters are handled automatically
- Inference: Standard framework for hypothesis tests and confidence intervals
1.10.2 Limitations
- Requires model specification: Sensitive to distributional assumptions
- Computational challenges: May require numerical optimization
- Small sample bias: Finite-sample properties can differ from asymptotic theory
- Boundary issues: MLEs may lie on the boundary of the parameter space
- Non-regularity: Properties may not hold if regularity conditions fail
1.11 Hypothesis Testing with MLEs
1.11.1 Likelihood Ratio Test
The likelihood ratio test statistic for testing \(H_0: \theta = \theta_0\) versus \(H_1: \theta \neq \theta_0\) is:
Under \(H_0\), \(\Lambda \xrightarrow{d} \chi^2_1\) as \(n \to \infty\).
1.11.2 Wald Test
The Wald test statistic is:
1.11.3 Score Test (Lagrange Multiplier Test)
The score test statistic is:
All three tests are asymptotically equivalent but may differ in finite samples.
1.12 Practical Guidelines
-
Check model assumptions: Ensure the distributional assumptions are reasonable for your data
-
Examine the log-likelihood surface: Plot \(\ell(\theta)\) to check for multiple maxima or flat regions
-
Use good starting values: For numerical optimization, choose sensible initial values (e.g., from method of moments)
-
Verify convergence: Check that optimization algorithms have converged properly
-
Assess sensitivity: Examine how estimates change with small perturbations to the data
-
Report uncertainty: Always provide standard errors or confidence intervals
-
Validate the model: Use residual diagnostics and goodness-of-fit tests
Building confidence through rigorous validation
-
George Casella and Roger L. Berger. Statistical Inference. Duxbury Press, 2nd edition, 2002. URL: https://www.cengage.com/c/statistical-inference-2e-casella/9780534243128/. ↩
-
Larry Wasserman. All of Statistics: A Concise Course in Statistical Inference. Springer, 2004. doi:10.1007/978-0-387-21736-9. ↩