Conditional Distributions (measure theory)

Conditional Distributions

Suppose (Ω,ΣΩ)(\Omega, \Sigma_{\Omega}) is a measureable metric space with (probability) measure μ\mu. Data are defined by the values the measurable map X(ω)X(\omega) take on. Data space relative to X ⁣:ΩXX \colon \Omega \to \mathscr{X} is (X,ΣX,μX)(\mathscr{X}, \Sigma_{\mathscr{X}}, \mu_{X}) where μX\mu_{X} is the pushforward of μ\mu by XX defined to be μXμX1\mu_{X} \coloneqq \mu \circ X^{-1}. Similarly, parameter space is (Θ,ΣΘ,μθ)(\Theta, \Sigma_{\Theta}, \mu_{\theta}) defined relative to the measurable map θ ⁣:ΩΘ\theta \colon \Omega \to \Theta.

A conditional (probability) distribution of XX given θ\theta is a function μXθ ⁣:Ω×ΣX[0,1]\mu_{X | \theta} \colon \Omega \times \Sigma_{\mathscr{X}} \to [0, 1] such that

  • fix AXΣXA_{\mathscr{X}} \in \Sigma_{\mathscr{X}}, then ϑμXθ(ϑ,AX)\vartheta \mapsto \mu_{X | \theta}(\vartheta, A_{\mathscr{X}}) is a measurable map for ϑΘ\vartheta \in \Theta, and
  • fix ϑΘ\vartheta \in \Theta, then AXμXθ(ϑ,AX)A_{\mathscr{X}} \mapsto \mu_{X | \theta}(\vartheta, A_{\mathscr{X}}) is a measure on (X,ΣX)(\mathscr{X}, \Sigma_{\mathscr{X}}).

The existence of such a function is proved in Theorem 33.3 of Billingley (2012). We list some properties of such a function next.

If there exists a measure λ\lambda on (X,ΣX)(\mathscr{X}, \Sigma_{\mathscr{X}}) such that μXθλ\mu_{X | \theta} \ll \lambda for all ϑΘ\vartheta \in \Theta, then the Radon-Nikodym derviative (density function) of μXθ\mu_{X | \theta} exists and is written as dμXθ/dλ=ρXθd \mu_{X | \theta} / d \lambda = \rho_{X | \theta}. In other words, μXθ(ϑ,AX)=AXρXθ(x)dλ(x)\mu_{X|\theta}(\vartheta, A_{\mathscr{X}}) = \int_{A_{\mathscr{X}}} \rho_{X|\theta}(x) d \lambda(x) for every ϑΘ\vartheta \in \Theta and for all AXΣXA_{\mathscr{X}} \in \Sigma_{\mathscr{X}}.

The joint measure μX,θ\mu_{X, \theta} is defined as μX,θ(AX×BΘ)=BΘμXθ(ϑ,AX)dμθ(ϑ)\mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{B_{\Theta}} \mu_{X|\theta} (\vartheta, A_{\mathscr{X}}) d \mu_{\theta}(\vartheta) Replacing the conditional distribution with its density function to get μX,θ(AX×BΘ)=BΘAXρXθ(x)dλ(x)dμθ(ϑ)\mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{B_{\Theta}} \int_{A_{\mathscr{X}}} \rho_{X|\theta} (x) d \lambda(x) d \mu_{\theta}(\vartheta) By Tonelli's theorem, a specific case of Fubini's theorem for non-negative measurable maps, we can switch the order of integration. μX,θ(AX×BΘ)=AXBΘρXθ(x)dμθ(ϑ)dλ(x)\begin{equation} \mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{A_{\mathscr{X}}} \int_{B_{\Theta}} \rho_{X|\theta} (x) d \mu_{\theta}(\vartheta) d \lambda(x) \end{equation}

The marginal distribution μX\mu_{X} can be recovered as μX(AX)=AXΘρXΘ(x)dμΘ(ϑ)dλ(x) \mu_{X}(A_{\mathscr{X}}) = \int_{A_{\mathscr{X}}} \int_{\Theta} \rho_{X|\Theta}(x) d \mu_{\Theta}(\vartheta) d \lambda(x) This implies a version of the Radon-Nikodym derivative of the marginal distribution dμXdλ=ΘρXΘ(x)dμΘ(ϑ)\frac{d \mu_{X}}{d \lambda} = \int_{\Theta} \rho_{X|\Theta}(x) d \mu_{\Theta}(\vartheta)

Similarly, there exists a conditional distribution of θ\theta given XX, μθX ⁣:Ω×ΣΘ[0,1]\mu_{\theta|X} \colon \Omega \times \Sigma_{\Theta} \mapsto [0, 1]. The joint distribution is recovered as μX,θ(AX×BΘ)=AXμθX(x,BΘ)dμX(x)\begin{equation} \mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{A_{\mathscr{X}}} \mu_{\theta|X}(x, B_{\Theta}) d \mu_X(x) \end{equation}

Bayes' Theorem

Building on the context above, the conditional distribution of θ\theta given XX can be written as μθX(x,BΘ)=BΘρXθΘρXθ(x)dμθ(ϑ)dμθ(ϑ)\mu_{\theta|X}(x, B_{\Theta}) = \int_{B_{\Theta}} \frac{\rho_{X|\theta}}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)} d \mu_{\theta}(\vartheta) Which means the density function is dμθXdμθ=ρXθΘρXθ(x)dμθ(ϑ)\frac{d \mu_{\theta|X}}{d \mu_{\theta}} = \frac{\rho_{X|\theta}}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)}

Below we sketch a proof of Bayes' Theorem following Schervish (2012).

Notice that there are two ways to represent the joint distribution μX,θ\mu_{X, \theta}. By (1)(1), we have μX,θ(AX×BΘ)=AXBΘρXθ(x)dμθ(ϑ)dλ(x)\mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{A_{\mathscr{X}}} \int_{B_{\Theta}} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta) d \lambda(x) By (2)(2), we have μX,θ(AX×BΘ)=AXμθX(x,BΘ)dμX(x)=AXμθX(x,BΘ)ΘρXθ(x)dμθ(ϑ)dλ(x)\mu_{X, \theta}(A_{\mathscr{X}} \times B_{\Theta}) = \int_{A_{\mathscr{X}}} \mu_{\theta|X}(x, B_{\Theta}) d \mu_{X}(x) = \int_{A_{\mathscr{X}}} \mu_{\theta|X}(x, B_{\Theta}) \int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta) d \lambda(x) From the last two displays, the terms inside the integrals (over AXA_{\mathscr{X}}) must be equal. μθX(x,BΘ)ΘρXθ(x)dμθ(ϑ)=BΘρXθ(x)dμθ(ϑ)\mu_{\theta|X}(x, B_{\Theta}) \int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta) = \int_{B_{\Theta}} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta) Rearranging, we find an expression for the conditional distribution of θ\theta given XX. μθX(x,BΘ)=BΘρXθ(x)ΘρXθ(x)dμθ(ϑ)dμθ(ϑ)\mu_{\theta|X}(x, B_{\Theta}) = \int_{B_{\Theta}} \frac{\rho_{X|\theta}(x)}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)} d \mu_{\theta}(\vartheta) The density function with respect to μθ\mu_{\theta} is then dμθXdμθ=ρXθΘρXθ(x)dμθ(ϑ)\frac{d \mu_{\theta | X}}{d \mu_{\theta}} = \frac{\rho_{X|\theta}}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)} If we further assume that there exists a measure λ\lambda on Θ\Theta such that μθλ\mu_{\theta} \ll \lambda, then there exists a version of the Radon-Nikodym derivative which recovers what is commonly taught as Bayes' theorem to undergraduates, written in terms of density functions. By the chain rule for Radon-Nikodym derivatives ρθX=μθXdλ=dμθXdμθdμθdλ\rho_{\theta|X} = \frac{\mu_{\theta|X}}{d \lambda} = \frac{d \mu_{\theta|X}}{d \mu_{\theta}} \frac{d \mu_{\theta}}{d \lambda } Next expand the Radon-Nikodym derivatives. ρθX=dμθXdμθdμθdλ=ρXθρθΘρXθ(x)dμθ(ϑ)\rho_{\theta|X} = \frac{d \mu_{\theta|X}}{d \mu_{\theta}} \frac{d \mu_{\theta}}{d \lambda } = \frac{\rho_{X|\theta} \rho_{\theta}}{\int_{\Theta} \rho_{X|\theta}(x) d \mu_{\theta}(\vartheta)} Last, expand the term dμθd \mu_{\theta} inside the integral. ρθX=ρXθρθΘρXθ(x)ρθ(ϑ)dλ(ϑ)\rho_{\theta|X} = \frac{\rho_{X|\theta} \rho_{\theta}}{\int_{\Theta} \rho_{X|\theta}(x) \rho_{\theta}(\vartheta) d \lambda(\vartheta)}

References

  • Billingsley, P. (2012). Probability and measure. John Wiley & Sons.
  • Schervish, M. J. (2012). Theory of statistics. Springer Science & Business Media.