Likelihood

Edward A. Roualdes

Contents

Introduction
Intuition

Introduction

The likelihood method estimates population parameters using an assumed distribution function and a randomly sampled dataset from a (singular) population of interest. To estimate the parameters from the population of interest, standard methods of calculusApex Calculus I is a great reference if you need an Open Educational Resource (free) resource to review derivatives, the calculus behind maximization andminimization. are employed. The goal is to find the most likely values of the population parameters given a particular dataset. As such the best guess of the parameters derived from this method is called the maximum likelihood estimate.

The logic underlying the likelihood method go like this. Set up the likelihood function. The maximum likelihood estimate is the argument of the likelihood function that maximizes the function. Often, this is simply written as \[ \hat{\theta} = \text{argmax}_{\theta} L( \theta | \mathbf{X}) \] to denote that the best guess is the maximal argumentDon't let the new notation \( \text{argmax}_x f(x) \) stand in your way. Consider an example that you can reason about somewhat easily. What is the argument \( x \) that maximizes the function \(f(x) = -x^2 \)? to the likelihood function given the data \( \mathbf{X} \). The calculus is then left to the practioner. These notes aim to provide a short introduction to the intuition behind the likelihood functions setup and to show the most common analytical strategy for finding the maximum likelihood estimates.

Intuition

The likelihood function is defined relative to the density function \( f \) of the distribution that is assumed to fit the population of interest. The likelihood is defined to be the product of the density function evaluated at each observation in the sample. We think of the likelihood function as a function of the parameter(s) of interest, here generalized to \( \theta \), given the random variables \( \mathbf{X} \). \[ L(\theta | \mathbf{X}) = \prod_{n = 1}^N f(X_n | \theta) \]

The intuition behind the products of the density function goes like this. Imagine you have \( 3 \) observations from a fair coin, H, H, T, H. The probability associated with this event is \[ \frac{1}{2} \cdot \frac{1}{2} \cdot \frac{1}{2} \cdot \frac{1}{2}. \]

Now, imagine that you don't know that the coin is fair, instead the probability of heads is just \( p \). The probabilityYou'd be on the right track if you're imagining that in four flips, three heads and one tail might suggest that \( p = 0.75 \). is rewritten as \[ p \cdot p \cdot (1 - p) \cdot p. \]

Next, write this probability using the density function of a Bernoulli distributionSee the notes on the Bernoulli distribution if you need a quick refresher. Since we map heads to \( 1 \) and tails to \( 0 \), we have \[ f(1 | p) \cdot f(1 | p) \cdot f(0 | p) \cdot f(1 | p). \]

The last step in understanding the setup of the likelihood function is to recognize that until we observe, say, H, H, T, H, we might as well treat these observations as random variables, \( X_1, X_2, X_3, X_4 \). In this case the functional form is \[ f(X_1 | p) \cdot f(X_2 | p) \cdot f(X_3 | p) \cdot f(X_4 | p). \]

The discussion above captures the intuition behind the setup of the likelihood function. From here the main differences are notational and a conceptual understanding of how we can treat this product as a function of the unknown parameter \( p \).

To get from \[ f(X_1 | p) \cdot f(X_2 | p) \cdot f(X_3 | p) \cdot f(X_4 | p) \] to the formal definition of the likelihood function, we generalize the unknown parameter \( p \) to \( \theta \), thinking that this method should apply to any distribution's density function, and we use product notation, which is analogous to summation notation, to expand the sample to any size \( N \) \[ \prod_{n = 1}^N f(X_n | \theta). \]

Once we have \( N \) observartions, our sample of random variables is bound to specific values. On the other hand, the unknown parameter \( \theta \) is not specified. The conceptual jump of the likelihood function is to treat the form \[ \prod_{n = 1}^N f(X_n | \theta ) \] as a function of the unknown parameter \( \theta \).

The notation \( L(\theta | \mathbf{X}) \) implies that the likelihood function maps the combination of data \( \mathbf{X} \) and density function \( f \) to unique value of the parameter \( \theta \)If a likelihood function maps one sample \( \mathbf{X} \) to more than one values of \( \theta \), we call the parameters \( \theta \) unidentifiable.. The specific value of \( \theta \) that maximizes the likelihood function is the best guess of the unknown population parameter. The value \( \hat{\theta} \) is called the maximum likelihood estimate of \( \theta \).

To bring the general likelihood function back down to earth, consider the following plot depicting the scenario introduced above: the observations H, H, T, H from a Bernoulli distribution with unknown population parameter \( p \). From exactly these four observations, the argument that maximizes the likelihood function is \( \hat{p} = 0.75 \).

Maximization

The best way to demonstrate maximizing a likelihood function is to walk through an example. Suppose you have a sample of \( N \) observations \( X_1, \ldots, X_N \) all randomly sampled from the same population. We'll assume the population follows the Rayleigh distribution with unknown parameter \( \sigma \), to be estimated from the data. The density function of the Rayleigh distribution is \[ f(x | \sigma ) = \frac{x}{\sigma^2} e^{-x^2 / (2 \sigma^2)} \] for \( x \in [0, \infty) \) and \( \sigma > 0 \).

To find the maximum likelihood estimate,First step, write out the likelihood function. start by writing out the likelihood function. \[ \begin{aligned} L( \sigma | \mathbf{X} ) & = \prod_{n = 1}^N f(X_n | \sigma) \\ & = \prod_{n = 1}^N \frac{X_n}{\sigma^2} e^{-X_n^2 / (2 \sigma^2)} \\ \end{aligned} \]

The goal is to find the value \( \sigma \) that maximizes the likelihood function \(L( \sigma | \mathbf{X} ) \). Both humans and computers have difficulty working with products and exponents of functions. Therefore, it is common take the natural logNext, find the log-likelihood, \( \ell = \log{L} \). of the likelihood function. This is so common, the log of the likelihood function is often just referred to as the log-likelihood function. We'll denote this function \( \ell(\sigma | \mathbf{X}) = \log{L(\sigma | \mathbf{X} ) } \).

Recall from calculus that we can find local maxima/minima by differentiating a function, setting the derivative equal to zero, and solving for the variable of interest. In this scenario, the variable of interest is the unknown population parameter.

Often it's helpful to simplify the log-likelihood function to aid differentiation. The simplified log-likelihood is \[ \begin{aligned} \ell(\sigma | \mathbf{X}) & = \sum_{n = 1}^N \log \left\{ \frac{X_n}{\sigma^2} e^{-X_n^2 / (2 \sigma^2)} \right\} \\ & = \sum_{n = 1}^N \left\{ \log{X_n} - 2 \log{\sigma} - X_n^2 / (2\sigma^2) \right\} \\ & = \sum_{n = 1}^N \log{X_n} - 2N\log{\sigma} - \frac{1}{2\sigma^2}\sum_{n = 1}^N X_n^2 \\ \end{aligned} \]

Proceed by taking the derivativeTake the derivative of the simplified log-likelihood with respect to the unknown parameter., with respect to the unknown population parameter, of the simplified log-likelihood. \[ \frac{d \ell}{d \sigma} = -\frac{2N}{\sigma} + \frac{1}{\sigma^3} \sum_{n = 1}^N X_n^2\]

Set the derivative equal to zero Set the derivative equal to zero and solve. and solve for the unknown population parameter. \[ \frac{2N}{\sigma} = \frac{1}{\sigma^3} \sum_{n=1}^N X_n^2 \]

Collecting \( \sigma \)s on the left hand side yields, \[ 2N\sigma^2 = \sum_{n=1}^N X_n^2. \]

Manipulate the expression until you find a solution for the parameter of interest. At this point, we put a hat over the parameter to recognize that it is our best guess of the parameter of interest. \[ \hat{\sigma} = \sqrt{\frac{1}{2N} \sum_{n = 1}^N X_n^2} \]

The maximum likelihood estimatePut a hat on your solution to formally note that this is your best guess based on the data. \( \hat{\sigma} \) is the final solution. With data from a population assumed to follow the Rayleigh distribution, this is the estimate for the population parameter \( \sigma \).


Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International