Likelihood

Introduction

The likelihood method estimates parameters using an assumed distribution on the data and a randomly sampled dataset from this distribution. To estimate the parameters, one can use either standard methods of calculus or a computer. This page will cover the aspects of calculus behind the likelihood method.

The goal is to find the most likely value of the parameter(s) given a set of random variables, X1,X2,,XNX_1, X_2, \ldots, X_N. When the solution to the likelihood method is found using random variables, we call the solution the maximum likelihood estimator. Once you observe some values, and the data assume the role of the random variables, then you can plug in the data and calculate an estimate of the parameter(s). The actual value, using a specific dataset, is called the maximum likelihood estimate.

The logic underlying the likelihood method goes like this. Set up the likelihood function. The maximum likelihood estimator is the argument of the likelihood function that maximizes the likelihood function. Often, this is written as θ^=argmaxθL(θX)\hat{\theta} = \operatorname{argmax}_{\theta} L(\theta | \mathbf{X}) to denote that the best guess is the maximal argument to the likelihood function given the data X=(X1,X2,,XN)\mathbf{X} = (X_1, X_2, \ldots, X_N). The calculus is then left to the practioner, where either pen and paper or a computer will do. These notes aim to provide a short introduction to the intuition behind the likelihood function setup and to show the most common analytical strategy for finding the maximum likelihood estimates.

Intuition (Bernoulli)

The likelihood function is defined relative to the density function f(xθ)f(x | \theta) of the distribution that is assumed to have generated the data. The likelihood is defined to be the product of the density function evaluated at each datum in the dataset. We think of the likelihood function as a function of the parameter(s), generalized as θ\theta, given the random variables X\mathbf{X}. L(θX)=n=1Nf(Xnθ)L(\theta | \mathbf{X}) = \prod_{n=1}^N f(X_n | \theta) The intuition behind the product of the density functions goes like this. Imagine you have 44 random variables representing flips of fair coin. Say the outcomes are H, H, T, H\text{H, H, T, H}. Assuming the random variables are independent, the probability associated with this event is 12121212\frac{1}{2}\cdot\frac{1}{2}\cdot\frac{1}{2}\cdot\frac{1}{2}

Now, imagine that you don't know that the coin is fair, instead all you know is that the probability of heads is some number pp. The probability above can is then rewritten as pp(1p)pp \cdot p \cdot (1 - p) \cdot p

Next, since we know that Bernoulli distribution is an appropriate model of coin flips, write this probability using the density function of a Bernoulli distribution. Since the Bernoulli distribution's density function maps heads to 11 and tails to 00 we have f(1p)f(1p)f(0p)f(1p)f(1 | p) \cdot f(1 | p) \cdot f(0 | p) \cdot f(1 | p)

The last step in understanding the setup of the likelihood function is to recognize that until we observe data such as H, H, T, H\text{H, H, T, H}, we might as well treat these observations as random variables, X1,X2,X3,X4X_1, X_2, X_3, X_4. In this case, the functional form is f(X1p)f(X2p)f(X3p)f(X4p)f(X_1 | p) \cdot f(X_2 | p) \cdot f(X_3 | p) \cdot f(X_4 | p)

The discussion above captures the intuition behind the setup of the likelihood function. From here the main differences are notational and a conceptual understanding of how we can treat this product as a function of the unknown parameter pp.

To get from f(X1p)f(X2p)f(X3p)f(X4p)f(X_1 | p) \cdot f(X_2 | p) \cdot f(X_3 | p) \cdot f(X_4 | p) to the general definition of the likelihood function, we generalize the unknown parameter pp to θ\theta, thinking that this method should apply to any distribution's density function. Further, we use product notation, which is analogous to summation notation, to generalize to an arbitrary number of random variables NN n=1Nf(XNθ)\prod_{n=1}^N f(X_N | \theta)

Once we have NN observartions, our collection of random variables {Xn}\{X_n\} is bound to specific values {xn}\{x_n\}. On the other hand, the unknown parameter θ\theta is not specified. The conceptual jump of the likelihood function is to treat the form L(θX)=n=1Nf(XNθ)L(\theta | \mathbf{X}) = \prod_{n=1}^N f(X_N | \theta) as a function of the unknown parameter θ\theta. We name the likelihood function LL and think of it as a function of the unknown parameter(s) θ\theta given a fixed set of data X=(X1,,XN)\mathbf{X} = (X_1, \ldots, X_N). The specific value of θ\theta that maximizes the likelihood function is the best guess of the unknown parameter.

In an attempt to bring the general likelihood function back down to earth, consider the following plot depicting the scenario introduced above: the observations H, H, T, H\text{H, H, T, H} from a Bernoulli distribution with unknown parameter pp. From exactly these four observations, the argument that maximizes the likelihood function is p^=0.75\hat{p} = 0.75.

p^\hat{p}
L(pX)L(p | \mathbf{X})
pp

Intuition (Normal)

Consider three data from a Normal distribution with parameter values that I'm keeping secret: x1=2,x2=0.5,x3=0.3x_1 = 2, x_2 = -0.5, x_3 = 0.3 Your job is to guess the values of μ\mu and σ\sigma that I have in mind.

With only three data points, displayed on the plot below as empty circles, you are unlikely to guess exactly the values of (μ,σ)(\mu, \sigma) that I have in mind, but you can still form a guess. The point of this page, is that a good guess will be the values of (μ,σ)(\mu, \sigma) that maximize the likelihood: L(μ,σ2,0.5,0.3)L(\mu, \sigma | 2, -0.5, 0.3) =0.0072 = 0.0072.

Normal(μ=\text{Normal}(\mu = 0\, 0 \, ,σ=, \sigma = 1\, 1 \, ))

L(μ,σX)L(\mu, \sigma | \mathbf{X})
xx

Example

The last way we'll' demonstrate the maximum likelihood method is by walking through an example. Suppose you have a sample of NN observations X1,,XNX_1, \ldots, X_N all randomly sampled from the same distribution. We'll assume we know that the Rayleigh distribution generated our data, but that we don't know the parameter σ\sigma. We seek to estimate σ\sigma from the data. The density function of the Rayleigh distribution is f(xσ)=xσ2exp{x2/(2σ2)}f(x | \sigma) = \frac{x}{\sigma^2} \exp{\left\{-x^2 / (2\sigma^2) \right\}} for x[0,)x \in [0, \infty) and σ>0\sigma > 0.

To find the maximum likelihood estimate of σ\sigma, start by writing out the likelihood funciotn. L(σX)=n=1NXnσ2exp{Xn2/(2σ2)}L(\sigma | \mathbf{X}) = \prod_{n=1}^N \frac{X_n}{\sigma^2} \exp{\left\{-X_n^2 / (2\sigma^2) \right\}} The goal is to find the value σ\sigma that maximizes the likelihood function L(σX)L(\sigma | \mathbf{X}).

Both humans and computers have difficulty working with products and exponents of functions. Therefore, it is common take the natural log of the likelihood function. This is so common, the log of the likelihood hood function has its own name, the log-likelihood function. The log-likelihood function is written as (σX)=logL(σX)=n=1Nlogf(Xnσ)=n=1NlogXn2logσXn2/(2σ2)\ell(\sigma | \mathbf{X}) = \log{L(\sigma | \mathbf{X})} = \sum_{n=1}^N \log{f(X_n | \sigma)} = \sum_{n=1}^N \log{X_n} - 2\log{\sigma} - X_n^2 / (2 \sigma^2) where we've used properties of log\log to turn the product into a sum.

(σX)\ell(\sigma | \mathbf{X})
xx

Recall from calculus that we can find local maxima/minima by differentiating a function, setting the derivative equal to zero, and solving for the variable of interest. In this scenario, the variable of interest is the unknown parameter, σ\sigma.

Often it's helpful to simplify the log-likelihood function to aid differentiation. In this case, the most helpful simplification is to realize that the first term within the sum is constant with respect to σ\sigma and so it can be dropped (σX)n=1N2logσXn2/(2σ2)\ell(\sigma | \mathbf{X}) \propto \sum_{n=1}^N -2 \log{\sigma} - X_n^2 / (2 \sigma^2) The symbol \propto is read as proportional to: the log-likelihood function for the Rayleigh distribution (σX)\ell(\sigma | \mathbf{X}) is proportional to the term on the right with respect to σ\sigma. We call the symbol propto (prop-to), short for proportional to.

To find the maximum of \ell, we'll take the derivative with respect to σ\sigma ddσ=2Nσ+1σ3n=1NXn2\frac{d \ell}{d \sigma} = \frac{-2N}{\sigma} + \frac{1}{\sigma^3} \sum_{n=1}^N X_n^2

Next, set the derivative equal to zero and solve for σ\sigma. 2Nσ2=n=1NXn22 N \sigma^2 = \sum_{n=1}^N X_n^2

Manipulate the expression until you find a solution for the parameter of interest. At this point, we put a hat over the parameter to recognize that it is our best guess of the unknown parameter based on the random variables X\mathbf{X}. σ^=12Nn=1NXn2\hat{\sigma} = \sqrt{\frac{1}{2N} \sum_{n=1}^N X_n^2} The maximum likelihood estimator σ^\hat{\sigma} is the final solution. Withdata from a Rayleigh distribution, this solution tells you how to best estimate the unknown parameter σ\sigma.

References

Conditional probability. Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. Accessed 2024-04-11.

Law of total probability. Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. Accessed 2024-04-11.

Partition of a set. Wikipedia: The Free Encyclopedia. Wikimedia Foundation, Inc. Accessed 2024-04-11.


This work is licensed under the Creative Commons Attribution 4.0 International License.