Conditional Distributions (measure theory)

Conditional Distributions

Suppose $(\Omega, \Sigma_{\Omega})$ is a measureable metric space with (probability) measure $\mu$ . Data are defined by the values the measurable map $X(\omega)$ take on. Data space relative to $X \colon \Omega \to \mathscr{X}$ is $(\mathscr{X}, \Sigma_{\mathscr{X}}, \mu_{X})$ where $\mu_{X}$ is the pushforward of $\mu$ by $X$ defined to be $\mu_{X} \coloneqq \mu \circ X^{-1}$ . Similarly, parameter space is $(\Theta, \Sigma_{\Theta}, \mu_{\theta})$ defined relative to the measurable map $\theta \colon \Omega \to \Theta$ .

A conditional (probability) distribution of $X$ given $\theta$ is a function $\mu_{X | \theta} \colon \Omega \times \Sigma_{\mathscr{X}} \to [0, 1]$ such that

fix $A_{\mathscr{X}} \in \Sigma_{\mathscr{X}}$ , then $\vartheta \mapsto \mu_{X | \theta}(\vartheta, A_{\mathscr{X}})$ is a measurable map for $\vartheta \in \Theta$ , and
fix $\vartheta \in \Theta$ , then $A_{\mathscr{X}} \mapsto \mu_{X | \theta}(\vartheta, A_{\mathscr{X}})$ is a measure on $(\mathscr{X}, \Sigma_{\mathscr{X}})$ .

The existence of such a function is proved in Theorem 33.3 of Billingley (2012). We list some properties of such a function next.

If there exists a measure $\lambda$ on $(\mathscr{X}, \Sigma_{\mathscr{X}})$ such that $\mu_{X | \theta} \ll \lambda$ for all $\vartheta \in \Theta$ , then the Radon-Nikodym derviative (density function) of $\mu_{X | \theta}$ exists and is written as $d \mu_{X | \theta} / d \lambda = \rho_{X | \theta}$ . In other words, $\mu_{X|\theta}(\vartheta, A_{\mathscr{X}}) = \int_{A_{\mathscr{X}}} \rho_{X|\theta}(x) d \lambda(x)$ for every $\vartheta \in \Theta$ and for all $A_{\mathscr{X}} \in \Sigma_{\mathscr{X}}$ .

The joint measure $\mu_{X, \theta}$ is defined as $\mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{B_{\Theta}} \mu_{X|\theta} (\vartheta, A_{\mathscr{X}}) d \mu_{\theta}(\vartheta)$ Replacing the conditional distribution with its density function to get $\mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{B_{\Theta}} \int_{A_{\mathscr{X}}} \rho_{X|\theta} (x) d \lambda(x) d \mu_{\theta}(\vartheta)$ By Tonelli's theorem, a specific case of Fubini's theorem for non-negative measurable maps, we can switch the order of integration. $\begin{equation} \mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{A_{\mathscr{X}}} \int_{B_{\Theta}} \rho_{X|\theta} (x) d \mu_{\theta}(\vartheta) d \lambda(x) \end{equation}$

The marginal distribution $\mu_{X}$ can be recovered as $\mu_{X}(A_{\mathscr{X}}) = \int_{A_{\mathscr{X}}} \int_{\Theta} \rho_{X|\Theta}(x) d \mu_{\Theta}(\vartheta) d \lambda(x)$ This implies a version of the Radon-Nikodym derivative of the marginal distribution $\frac{d \mu_{X}}{d \lambda} = \int_{\Theta} \rho_{X|\Theta}(x) d \mu_{\Theta}(\vartheta)$

Similarly, there exists a conditional distribution of $\theta$ given $X$ , $\mu_{\theta|X} \colon \Omega \times \Sigma_{\Theta} \mapsto [0, 1]$ . The joint distribution is recovered as $\begin{equation} \mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{A_{\mathscr{X}}} \mu_{\theta|X}(x, B_{\Theta}) d \mu_X(x) \end{equation}$

Bayes' Theorem

Building on the context above, the conditional distribution of

\theta

given

X

can be written as

\mu_{\theta|X}(x, B_{\Theta}) = \int_{B_{\Theta}} \frac{\rho_{X|\theta}}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)} d \mu_{\theta}(\vartheta)

Which means the density function is

\frac{d \mu_{\theta|X}}{d \mu_{\theta}} = \frac{\rho_{X|\theta}}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)}

Below we sketch a proof of Bayes' Theorem following Schervish (2012).

Notice that there are two ways to represent the joint distribution

\mu_{X, \theta}

. By

(1)

, we have

\mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{A_{\mathscr{X}}} \int_{B_{\Theta}} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta) d \lambda(x)

(2)

, we have

\mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{A_{\mathscr{X}}} \mu_{\theta|X}(x, B_{\Theta}) d \mu_{X}(x) = \int_{A_{\mathscr{X}}} \mu_{\theta|X}(x, B_{\Theta}) \int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta) d \lambda(x)

From the last two displays, the terms inside the integrals (over

A_{\mathscr{X}}

) must be equal.

\mu_{\theta|X}(x, B_{\Theta}) \int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta) = \int_{B_{\Theta}} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)

Rearranging, we find an expression for the conditional distribution of

\theta

given

X

\mu_{\theta|X}(x, B_{\Theta}) = \int_{B_{\Theta}} \frac{\rho_{X|\theta}(x)}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)} d \mu_{\theta}(\vartheta)

The density function with respect to

\mu_{\theta}

is then

\frac{d \mu_{\theta | X}}{d \mu_{\theta}} = \frac{\rho_{X|\theta}}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)}

If we further assume that there exists a measure

\lambda

\Theta

such that

\mu_{\theta} \ll \lambda

, then there exists a version of the Radon-Nikodym derivative which recovers what is commonly taught as Bayes' theorem to undergraduates, written in terms of density functions. By the chain rule for Radon-Nikodym derivatives

\rho_{\theta|X} = \frac{\mu_{\theta|X}}{d \lambda} = \frac{d \mu_{\theta|X}}{d \mu_{\theta}} \frac{d \mu_{\theta}}{d \lambda }

Next expand the Radon-Nikodym derivatives.

\rho_{\theta|X} = \frac{d \mu_{\theta|X}}{d \mu_{\theta}} \frac{d \mu_{\theta}}{d \lambda } = \frac{\rho_{X|\theta} \rho_{\theta}}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)}

Last, expand the term

d \mu_{\theta}

inside the integral.

\rho_{\theta|X} = \frac{\rho_{X|\theta} \rho_{\theta}}{\int_{\Theta} \rho_{X|\theta}(x) \rho_{\theta}(\vartheta) d \lambda(\vartheta)}

References

Billingsley, P. (2012). Probability and measure. John Wiley & Sons.
Schervish, M. J. (2012). Theory of statistics. Springer Science & Business Media.