4.4 Maximum Likelihood Estimators
Estimators can be constructed in various ways, and there is some controversy as to which is most suitable in any given situation. There is considerable literature on the use of unbiased estimators, but biased estimators are sometimes more appropriate. Consider two estimators for variance:
[4.27]
[4.28]
The first is widely used because it is unbiased. However, if X is known to be normal, the second has a lower MSE than either the first or sample estimator [4.5]. Sometimes unbiased estimators have disturbing properties or are downright nonsensical. Indeed, there are circumstances where unbiased estimators do not exist. As an alternative, it might seem appropriate to seek estimators with minimal MSE, but there is no systematic way to identify such estimators.
4.4.1 ML estimators
Maximum likelihood (ML) is an approach to constructing estimators that is widely applicable. The resulting ML estimators are not always optimal in terms of bias or MSE, but they tend to be good estimators nonetheless. ML estimators are attractive because they exist and can be easily identified in most situations. Unlike sample estimators, which make no use of an assumed underlying distribution, ML estimators fully utilize such information.
Consider a sample {X[1], X[2], … , X[m]} whose underlying distribution is known except for some parameter θ. To emphasize this θ dependence, we denote the PDF of X as ϕ(x|θ). Because the random vectors that comprise a sample are independent, the PDF for the entire sample is
[4.29]
For any realization {x[1], x[2], … , x[m]}, the sample PDF ϕm is then a function of θ. We call this the likelihood function and denote it L(θ|x[1], x[2], … , x[m]) or simply L(θ). Mathematically, L(θ|x[1], x[2], … , x[m]) is identical to ϕm(x[1], x[2], … , x[m]|θ). The new name and notation merely indicate a different perspective. We think of ϕm as a function of a realization dependent upon a parameter. We think of L as a function of a parameter dependent upon a realization. We define the ML estimate of θ as that value h that maximizes the value of the likelihood function. It is the value for θ that associates the maximum probability density with the data set {x[1], x[2], … , x[m]}.
4.4.2 ML estimates of scalar parameters
Given a differentiable likelihood function L(θ) of a scalar parameter θ, solving for an ML estimate is a simple application of calculus. Standard techniques for maximizing differentiable functions apply. We take the derivative of the likelihood function and set it equal to 0:
[4.30]
Roots of this equation are investigated to determine which maximizes L(θ). If θ is restricted to some region Ω, values on the boundary of Ω must also be investigated. For example, if θ represents a variance, then Ω = [0,∞), and it is conceivable that the likelihood function is maximized at θ = 0.
It is often convenient to maximize the logarithm of the likelihood function, which is called the log-likelihood function. To see why, compare
[4.31]
with
[4.32]
Because the latter is a sum, its derivative is easier to work with. The logarithm function is strictly increasing, so any value h that maximizes L will also maximize log[L]. In some cases, an analytic solution for the roots of the equation
[4.33]
can be found. Alternatively, numerical techniques such as Newton’s method must be applied.
It is possible that the log-likelihood function will fail to achieve a maximum on Ω or it may achieve a maximum at multiple points. In either circumstance, review your assumptions either to determine what is preventing the log-likelihood function from achieving a maximum or to indicate some criteria for selecting one of several maxima as your estimate.
4.4.3 ML estimates of non-scalar parameters
The foregoing technique generalizes for estimating vector-valued parameters θ, but gradients replace derivatives. We may attempt to directly maximize the likelihood function, solving
[4.34]
or work with the log-likelihood function, solving
[4.35]
Again, this can sometimes be solved analytically, but numerical solutions are often necessary. Similar issues of existence and uniqueness arise.
A matrix-valued parameter θ may be estimated similarly. Simply arrange the components of the matrix into a vector and proceed accordingly.