Abstract

This technical report analyzes two mathematically equivalent versions of Hamiltonian dynamics within the Hamiltonian Monte Carlo (HMC) algorithm as implemented in the probabilistic programming language Stan. This report compares Stan’s HMC dynamics to an alternative writing of the same dynamics, both under the dense Euclidean metric. While the proposed reframing of Hamiltonian dynamics reduces the number of calculations within the HMC algorithm, there is no tangible reduction in the run time of the overall sampler for the tested target distributions. This is likely due to the fact that gradient evaluations of the target distributions dominate the run time.

Introduction

This technical report analyzes two mathematically equivalent versions of Hamiltonian dynamics within the Hamiltonian Monte Carlo (HMC) algorithm as implemented in the probabilistic programming language Stan. This report compares Stan’s HMC dynamics to an alternative writing of the same dynamics, both under the dense Euclidean metric, but does not attempt to explain Hamiltonian Monte Carlo. For a thorough introduction to HMC, consider Michael Betancourt’s A Conceptual Introduction to Hamiltonian Monte Carlo and Radford Neal’s MCMC using Hamiltonian dynamics. While the proposed dynamics reduce the number of calculations within the HMC algorithm, there is no tangible reduction in the run time of the overall sampler for the tested target distributions. This is likely due to the fact that the majority of the HMC run time is dominated by the cost of leapfrog steps, which involve costly gradient evaluations of the target distribution.

This report first describes the dynamics as currently implemented in Stan v2.23.0, then describes the proposed dynamics. A simulation study is carried out in the section Numerical Experiments, and a short discussion of the results and of potential future research concludes the report.

Proposed Dynamics

Let \(\pi(q) \propto \exp{(-U(q))}\) be the target distribution of interest for parameters \(q \in \mathbb{R}^{d}\). As of 2020-06-16, the Stan GitHub branch develop contains a version of HMC that is written as the Hamiltonian

\[ H_{dv}(q, p) = U(q) + \frac{1}{2} p^T M^{-1} p, \]

where the subscript \(dv\) stands for develop and \(M^{-1} \approx \mathbb{E}_{\pi} (q - \mu)(q - \mu)^T\) is dense \(d \times d\) matrix. Under this scheme, the momenta follow a multivariate Normal distribution, \(p \sim N(0, M)\).

The work of Ma, Chen, and Fox (2015), in A Complete Recipe for Stochastic Gradient MCMC, suggests a richer framework than presented below1. In an effort to simplify Ma’s notation, some of their terms, that will drop out from this proposal eventually anyway, have been removed.

Let \(z = (q, p)\). Ma et al.’s framework dictates dynamics as

\[\mathrm{d}z = -Q(z) \nabla H(z)\]

where \(Q\) is any skew symmetric matrix. Some choices of \(Q\) will lead to faster convergence than others. This proposal maintains \(M^{-1}\) as a pre-conditioner, but incorporates it into \(Q\) instead of in the distribution of the momenta.

In the notation of Ma’s framework, branch develop’s version of HMC is carried out as follows. Draw \(p \sim N(0, M)\), necessitating a Cholesky decomposition \(LL^{T} = M^{-1}\) and a solve(). This was chosen by design because \(M^{-1}\) itself is needed in the next step, and it’s expensive to keep both \(L\) and \(M^{-1}\) in memory. With \(p\) in hand and

\[Q(z) = \begin{bmatrix} 0 && -I \\ I && 0 \end{bmatrix}\]

update \(z\) according to the dynamcis

\[ \mathrm{d}z = -Q(z) \nabla H(z) = -\begin{bmatrix} 0 && -I \\ I && 0 \end{bmatrix} \begin{bmatrix} \nabla U(q) \\ M^{-1}p \end{bmatrix} = \begin{bmatrix} M^{-1}p \\ -\nabla U(q) \end{bmatrix}.\]

Stan appproximates the dynamics of this Hamiltonian system in time \(t\) using the leapfrog integrator, which moves in steps of size \(\epsilon\).

\[\begin{align} p_{t + \epsilon / 2} & = p_{t} - \frac{\epsilon}{2}\nabla U(q_{t}) \\ q_{t + \epsilon} & = q_{t} + \epsilon M^{-1}p_{t + \epsilon / 2} \\ p_{t + \epsilon} & = p_{t + \epsilon / 2} - \frac{\epsilon}{2} \nabla U(q_{t + \epsilon}) \end{align}\]

Combining the first two steps into one and writing \(LL^T = M^{-1}\), we see that \(q\) is updated as

\[q_{t + \epsilon} = q_{t} + \epsilon LL^T p_{t} - \frac{\epsilon^2}{2} LL^T \nabla U(q_{t}) \]

where \(p \sim N(0, (LL^T)^{-1})\) implies that \(LL^Tp \sim N(0, LL^T)\).

Now consider the proposed version of HMC, where \(p \sim N(0, I)\) gives a slightly different Hamiltonian,

\[H_{pr}(z) = U(q) + \frac{1}{2}p^Tp.\]

Take

\[ Q(z) = \begin{bmatrix} 0 && -L^T \\ L && 0 \end{bmatrix} \]

where \(LL^T = M^{-1}\). Since \(M^{-1}\) is only updated at the end of each adaption window, \(L\) is fixed throughout sampling. There is no need for \(M^{-1}\) itself while sampling, not even to generate new momenta \(p\). Hence there is no solve() under this proposed scheme during the sampling phase of Stan’s sampler.

The rest of this write up ensures that the other pieces of the Hamiltonian Monte Carlo algorithm are updated apppropriately to the modified Hamiltonian just presented.

The proposed dynamics then follow

\[ \mathrm{d}z = \begin{bmatrix} L^{T}p \\ -L \nabla U(q) \end{bmatrix}.\]

Using the same leapfrog integrator, the one step update of \(q\) is

\[ q_{t + \epsilon} = q_{t} + \epsilon L p_{t} - \frac{\epsilon^2}{2}LL^T \nabla U(q_{t}). \]

Under this proposal, since \(p \sim N(0, I)\), we have that \(Lp \sim N(0, LL^T)\) just as in the one step update of \(q\) under branch develop.

The last necessary change is to measure distance appropriately under the proposed Hamiltonian \(H_{pr}\). In develop, and using the notation of Betancourt (2017),

\[ q_+(\mathfrak{t}) - q_-(\mathfrak{t}) = \int_{t = 0}^{t = T(\mathfrak{t})} \mathrm{d}t M^{-1} \cdot p(t). \]

Under the proposal, replace \(M^{-1}\) with \(I\), since \(p \sim N(0, I)\) instead of \(p \sim N(0, M)\). This ensures distance is measured with respect to the appropriate metric and the Hamiltonian \(H_{pr}\) is appropriately calculated.

Computational Savings Under Proposed Dynamics

Under the proposed dynamics, there is no decomposition nor solve() while sampling, which otherwise occurs once for each transition at a cost of roughly \(d^3 / 3\) and \(d^2\) operations respectively. However, since generating \(p\) and calculating \(Lp\) is as costly as a solve(), once the decomposition is already performed, there is no net gain from foregoing the solve() step. Theoretically, this proposal could take advantage of storing only \(L\), which is lower triangular. Further, with no decomposition, the proposed dynamics should also consume less memory. This report makes no attempt at measuring the effects on memory consumption due to the proposed dynamics. For each transition, the net savings in computational cost is one decomposition with approximate cost \(d^3/3\).

Since the leapfrog integrator takes two half-steps to update the momenta, \(L \nabla U(q)\) is calculated twice within one complete leapfrog step under the proposed dynamics, once for each half-step. This matrix-vector multiplication can be made more efficient in Eigen since \(L\) is lower triangular2.

Beyond the evaluation of \(\nabla U\), which is addressed below, the difference in computational cost for each leapfrog integration between proposal and develop comes down to a comparison between the cost of evaluating \(L \nabla U\) twice and \(M^{-1}p\) once. For the proposal, because \(L\) is lower triangular, there are \(d(d+1) / 2\) multiplications for each \(L \nabla U\) calculation, for a total of \(d^2 + d\) multiplications in each leapfrog step. For branch develop, since \(M^{-1}\) is dense, this is a dense matrix with \(d^2\) elements multiplied by a vector \(p\) of lenth \(d\), for a computational cost of \(d^2\) multiplications. Essentially, calculating \(L \nabla U\) twice entails traversing the diagonal of a \(d \times d\) matrix one extra time for each leapfrog step compared to branch develop’s leapfrog step. For each leapfrog step, the net savings of the proposed dynamics compared to branch develop is \(-d\).

Next, consider the metric under which distance is measured. By measuring distance with respect to an identity metric instead of with respect to \(M^{-1}\), the calculation \(M^{-1}p\), which happens number of leapfrog steps + 1 for each transitions, is replaced with \(p^Tp\) in the proposed dynamics. For each leapfrog step, there is a net savings of about \(d^2\) by measuring distance with \(p^Tp\) under the proposed dynamics.

Assuming \(K\) leapfrog steps happen on average in each transition, then the proposed Hamiltonian dynamics saves approximately \(d^3/3 + (K + 1) * (d^2 - d)\) operations.

Despite these computational savings, when \(U(q)\) and \(\nabla U(q)\) involve matrix-matrix products, which generally scale cubic in computational cost, or other expensive operations, these savings are moot. To put it a another way, these savings will increasingly help when the calculation costs of \(U(q)\) and \(\nabla U(q)\) don’t dominate run time. In the numerical experiments below, evaluating \(U(q)\) and \(\nabla U(q)\) does indeed dominate the overall run time. This report does not specifically seek out such target distributions to make this proposal appear better. Instead, a set of arguably interesting target distributions were chosen and evaluated in a numerical experiment below.

Numerical Experiment

We compare the two versions of Hamiltonian dynamics presented above. All models were fit with CmdStan v2.23.0 and CmdStanR v0.0.0.9000, on a Mac Pro running macOS 10.15.5, which has a 2.7GHz Intel Xeon E5 processor and 64GB 1866 MHz DDR3 memory.

Three Stan programs were tested. The first two are zero centered multivariate Normal and Student-T distributions with dimensions \(d \in \{2^i\}\) for \(i = 1:7\). For both the Normal and the Student-T and for each dimension \(d\), a matrix \(\Sigma\) was created by sampling eigenvalues from \(\text{Gamma}(\alpha = 0.5, \beta = 1)\) distribution. This strategy is designed to make for an ill-conditioned target by spreading widely the eigenvalues over about 5 orders of magnitude. The matrix \(\Sigma\) was generated for each dimension of both the Normal and Student-T targets and shared throughout the replications. The third target distribution considered is a sparse logist regression model fit to the German credit dataset (Hoffman et al. 2019).

Each model fitting was replicated 30 times for each set of dynamics, branch develop and the proposal. Both dynamics used the same seed for each replication. The default arguments for Stan were used throughout, save adapt_delta which was increased to \(0.95\) for only the German credit dataset target.

For each replicate, we compare the distribution of effective sample sizes (ESS). Following Vehtari et al. (2019), we calculate bulk and tail ESS, as well as bulk and tail ESS for the square of the parameter of interest. All ESS calculations are scaled by either the total run time to fit 4 chains or the total number of leapfrog steps taken.

Normal Distribution

For each dimension \(d \in \{2^i\}\) for \(i = 1:7\), define \(\pi(q)\) as a multivariate Normal distribution with mean vector \(\mu = \mathbf{0}\) and covariance \(\Sigma\) generated by sampling \(d\) eigenvalues from a \(\text{Gamma}(\alpha = 0.5, \beta = 1)\) distribution.

The plot below shows various effective sample size calculations per second of total run time across 7 different dimensions. Note that the y-axis is log base 10 transformed.

The plot below shows various effective sample size calculations per leapfrog evaluation across 7 different dimensions. Note that the y-axis is log base 10 transformed.