Abstract

This technical report analyzes two mathematically equivalent versions of Hamiltonian dynamics within the Hamiltonian Monte Carlo (HMC) algorithm as implemented in the probabilistic programming language Stan. This report compares Stan’s HMC dynamics to an alternative writing of the same dynamics, both under the dense Euclidean metric. While the proposed reframing of Hamiltonian dynamics reduces the number of calculations within the HMC algorithm, there is no tangible reduction in the run time of the overall sampler for the tested target distributions. This is likely due to the fact that gradient evaluations of the target distributions dominate the run time.

Introduction

This technical report analyzes two mathematically equivalent versions of Hamiltonian dynamics within the Hamiltonian Monte Carlo (HMC) algorithm as implemented in the probabilistic programming language Stan. This report compares Stan’s HMC dynamics to an alternative writing of the same dynamics, both under the dense Euclidean metric, but does not attempt to explain Hamiltonian Monte Carlo. For a thorough introduction to HMC, consider Michael Betancourt’s A Conceptual Introduction to Hamiltonian Monte Carlo and Radford Neal’s MCMC using Hamiltonian dynamics. While the proposed dynamics reduce the number of calculations within the HMC algorithm, there is no tangible reduction in the run time of the overall sampler for the tested target distributions. This is likely due to the fact that the majority of the HMC run time is dominated by the cost of leapfrog steps, which involve costly gradient evaluations of the target distribution.

This report first describes the dynamics as currently implemented in Stan v2.23.0, then describes the proposed dynamics. A simulation study is carried out in the section Numerical Experiments, and a short discussion of the results and of potential future research concludes the report.

Proposed Dynamics

Let \(\pi(q) \propto \exp{(-U(q))}\) be the target distribution of interest for parameters \(q \in \mathbb{R}^{d}\). As of 2020-06-16, the Stan GitHub branch develop contains a version of HMC that is written as the Hamiltonian

\[ H_{dv}(q, p) = U(q) + \frac{1}{2} p^T M^{-1} p, \]

where the subscript \(dv\) stands for develop and \(M^{-1} \approx \mathbb{E}_{\pi} (q - \mu)(q - \mu)^T\) is dense \(d \times d\) matrix. Under this scheme, the momenta follow a multivariate Normal distribution, \(p \sim N(0, M)\).

The work of Ma, Chen, and Fox (2015), in A Complete Recipe for Stochastic Gradient MCMC, suggests a richer framework than presented below1. In an effort to simplify Ma’s notation, some of their terms, that will drop out from this proposal eventually anyway, have been removed.

Let \(z = (q, p)\). Ma et al.’s framework dictates dynamics as

\[\mathrm{d}z = -Q(z) \nabla H(z)\]

where \(Q\) is any skew symmetric matrix. Some choices of \(Q\) will lead to faster convergence than others. This proposal maintains \(M^{-1}\) as a pre-conditioner, but incorporates it into \(Q\) instead of in the distribution of the momenta.

In the notation of Ma’s framework, branch develop’s version of HMC is carried out as follows. Draw \(p \sim N(0, M)\), necessitating a Cholesky decomposition \(LL^{T} = M^{-1}\) and a solve(). This was chosen by design because \(M^{-1}\) itself is needed in the next step, and it’s expensive to keep both \(L\) and \(M^{-1}\) in memory. With \(p\) in hand and

\[Q(z) = \begin{bmatrix} 0 && -I \\ I && 0 \end{bmatrix}\]

update \(z\) according to the dynamcis

\[ \mathrm{d}z = -Q(z) \nabla H(z) = -\begin{bmatrix} 0 && -I \\ I && 0 \end{bmatrix} \begin{bmatrix} \nabla U(q) \\ M^{-1}p \end{bmatrix} = \begin{bmatrix} M^{-1}p \\ -\nabla U(q) \end{bmatrix}.\]

Stan appproximates the dynamics of this Hamiltonian system in time \(t\) using the leapfrog integrator, which moves in steps of size \(\epsilon\).

\[\begin{align} p_{t + \epsilon / 2} & = p_{t} - \frac{\epsilon}{2}\nabla U(q_{t}) \\ q_{t + \epsilon} & = q_{t} + \epsilon M^{-1}p_{t + \epsilon / 2} \\ p_{t + \epsilon} & = p_{t + \epsilon / 2} - \frac{\epsilon}{2} \nabla U(q_{t + \epsilon}) \end{align}\]

Combining the first two steps into one and writing \(LL^T = M^{-1}\), we see that \(q\) is updated as

\[q_{t + \epsilon} = q_{t} + \epsilon LL^T p_{t} - \frac{\epsilon^2}{2} LL^T \nabla U(q_{t}) \]

where \(p \sim N(0, (LL^T)^{-1})\) implies that \(LL^Tp \sim N(0, LL^T)\).

Now consider the proposed version of HMC, where \(p \sim N(0, I)\) gives a slightly different Hamiltonian,

\[H_{pr}(z) = U(q) + \frac{1}{2}p^Tp.\]

Take

\[ Q(z) = \begin{bmatrix} 0 && -L^T \\ L && 0 \end{bmatrix} \]

where \(LL^T = M^{-1}\). Since \(M^{-1}\) is only updated at the end of each adaption window, \(L\) is fixed throughout sampling. There is no need for \(M^{-1}\) itself while sampling, not even to generate new momenta \(p\). Hence there is no solve() under this proposed scheme during the sampling phase of Stan’s sampler.

The rest of this write up ensures that the other pieces of the Hamiltonian Monte Carlo algorithm are updated apppropriately to the modified Hamiltonian just presented.

The proposed dynamics then follow

\[ \mathrm{d}z = \begin{bmatrix} L^{T}p \\ -L \nabla U(q) \end{bmatrix}.\]

Using the same leapfrog integrator, the one step update of \(q\) is

\[ q_{t + \epsilon} = q_{t} + \epsilon L p_{t} - \frac{\epsilon^2}{2}LL^T \nabla U(q_{t}). \]

Under this proposal, since \(p \sim N(0, I)\), we have that \(Lp \sim N(0, LL^T)\) just as in the one step update of \(q\) under branch develop.

The last necessary change is to measure distance appropriately under the proposed Hamiltonian \(H_{pr}\). In develop, and using the notation of Betancourt (2017),

\[ q_+(\mathfrak{t}) - q_-(\mathfrak{t}) = \int_{t = 0}^{t = T(\mathfrak{t})} \mathrm{d}t M^{-1} \cdot p(t). \]

Under the proposal, replace \(M^{-1}\) with \(I\), since \(p \sim N(0, I)\) instead of \(p \sim N(0, M)\). This ensures distance is measured with respect to the appropriate metric and the Hamiltonian \(H_{pr}\) is appropriately calculated.

Computational Savings Under Proposed Dynamics

Under the proposed dynamics, there is no decomposition nor solve() while sampling, which otherwise occurs once for each transition at a cost of roughly \(d^3 / 3\) and \(d^2\) operations respectively. However, since generating \(p\) and calculating \(Lp\) is as costly as a solve(), once the decomposition is already performed, there is no net gain from foregoing the solve() step. Theoretically, this proposal could take advantage of storing only \(L\), which is lower triangular. Further, with no decomposition, the proposed dynamics should also consume less memory. This report makes no attempt at measuring the effects on memory consumption due to the proposed dynamics. For each transition, the net savings in computational cost is one decomposition with approximate cost \(d^3/3\).

Since the leapfrog integrator takes two half-steps to update the momenta, \(L \nabla U(q)\) is calculated twice within one complete leapfrog step under the proposed dynamics, once for each half-step. This matrix-vector multiplication can be made more efficient in Eigen since \(L\) is lower triangular2.

Beyond the evaluation of \(\nabla U\), which is addressed below, the difference in computational cost for each leapfrog integration between proposal and develop comes down to a comparison between the cost of evaluating \(L \nabla U\) twice and \(M^{-1}p\) once. For the proposal, because \(L\) is lower triangular, there are \(d(d+1) / 2\) multiplications for each \(L \nabla U\) calculation, for a total of \(d^2 + d\) multiplications in each leapfrog step. For branch develop, since \(M^{-1}\) is dense, this is a dense matrix with \(d^2\) elements multiplied by a vector \(p\) of lenth \(d\), for a computational cost of \(d^2\) multiplications. Essentially, calculating \(L \nabla U\) twice entails traversing the diagonal of a \(d \times d\) matrix one extra time for each leapfrog step compared to branch develop’s leapfrog step. For each leapfrog step, the net savings of the proposed dynamics compared to branch develop is \(-d\).

Next, consider the metric under which distance is measured. By measuring distance with respect to an identity metric instead of with respect to \(M^{-1}\), the calculation \(M^{-1}p\), which happens number of leapfrog steps + 1 for each transitions, is replaced with \(p^Tp\) in the proposed dynamics. For each leapfrog step, there is a net savings of about \(d^2\) by measuring distance with \(p^Tp\) under the proposed dynamics.

Assuming \(K\) leapfrog steps happen on average in each transition, then the proposed Hamiltonian dynamics saves approximately \(d^3/3 + (K + 1) * (d^2 - d)\) operations.

Despite these computational savings, when \(U(q)\) and \(\nabla U(q)\) involve matrix-matrix products, which generally scale cubic in computational cost, or other expensive operations, these savings are moot. To put it a another way, these savings will increasingly help when the calculation costs of \(U(q)\) and \(\nabla U(q)\) don’t dominate run time. In the numerical experiments below, evaluating \(U(q)\) and \(\nabla U(q)\) does indeed dominate the overall run time. This report does not specifically seek out such target distributions to make this proposal appear better. Instead, a set of arguably interesting target distributions were chosen and evaluated in a numerical experiment below.

Numerical Experiment

We compare the two versions of Hamiltonian dynamics presented above. All models were fit with CmdStan v2.23.0 and CmdStanR v0.0.0.9000, on a Mac Pro running macOS 10.15.5, which has a 2.7GHz Intel Xeon E5 processor and 64GB 1866 MHz DDR3 memory.

Three Stan programs were tested. The first two are zero centered multivariate Normal and Student-T distributions with dimensions \(d \in \{2^i\}\) for \(i = 1:7\). For both the Normal and the Student-T and for each dimension \(d\), a matrix \(\Sigma\) was created by sampling eigenvalues from \(\text{Gamma}(\alpha = 0.5, \beta = 1)\) distribution. This strategy is designed to make for an ill-conditioned target by spreading widely the eigenvalues over about 5 orders of magnitude. The matrix \(\Sigma\) was generated for each dimension of both the Normal and Student-T targets and shared throughout the replications. The third target distribution considered is a sparse logist regression model fit to the German credit dataset (Hoffman et al. 2019).

Each model fitting was replicated 30 times for each set of dynamics, branch develop and the proposal. Both dynamics used the same seed for each replication. The default arguments for Stan were used throughout, save adapt_delta which was increased to \(0.95\) for only the German credit dataset target.

For each replicate, we compare the distribution of effective sample sizes (ESS). Following Vehtari et al. (2019), we calculate bulk and tail ESS, as well as bulk and tail ESS for the square of the parameter of interest. All ESS calculations are scaled by either the total run time to fit 4 chains or the total number of leapfrog steps taken.

Normal Distribution

For each dimension \(d \in \{2^i\}\) for \(i = 1:7\), define \(\pi(q)\) as a multivariate Normal distribution with mean vector \(\mu = \mathbf{0}\) and covariance \(\Sigma\) generated by sampling \(d\) eigenvalues from a \(\text{Gamma}(\alpha = 0.5, \beta = 1)\) distribution.

The plot below shows various effective sample size calculations per second of total run time across 7 different dimensions. Note that the y-axis is log base 10 transformed.

The plot below shows various effective sample size calculations per leapfrog evaluation across 7 different dimensions. Note that the y-axis is log base 10 transformed.

These plots make it apparent that despite reducing the number of calculations during sampling, the proposed dynamics have not significantly altered the overall effectiveness of Stan’s HMC algorithm for Normal target distributions.

Stan Model for Gaussian Targets

## data {
##   int<lower=0> d;
##   matrix[d, d] S;
## }
## 
## transformed data {
##   vector[d] mu = rep_vector(0.0, d);
## }
## 
## parameters {
##   vector[d] x;
## }
## 
## model {
##   x ~ multi_normal(mu, S);
## }

Generate \(\Sigma\)

The following code generates \(\Sigma\) dependent on the dimension \(d\).

gen_sigma <- function(d) {
    eig <- sort(rgamma(d, 0.5, 1))
    QR <- qr(matrix(rnorm(d * d), nrow = d))
    Q <- qr.Q(QR)
    Q %*% diag(1 / eig) %*% t(Q)
}

Student-T Distribution

For each dimension \(d \in \{2^i\}\) for \(i = 1:7\), define \(\pi(q)\) as a multivariate Student-T distribution with mean vector \(\mu = \mathbf{0}\) and matrix \(\Sigma\) generated by sampling \(d\) eigenvalues from a \(\text{Gamma}(\alpha = 0.5, \beta = 1)\) distribution. Degrees of freedom are set to three, \(\nu = 3\), throughout all simulations.

The plot below shows various effective sample size calculations per second of total run time across 7 different dimensions. Note that the y-axis is log base 10 transformed.

The plot below shows various effective sample size calculations per leapfrog evaluation across 7 different dimensions. Note that the y-axis is log base 10 transformed.

These plots make it apparent that despite reducing the number of calculations during sampling, the proposed dynamics have not significantly altered the overall effectiveness of Stan’s HMC algorithm for Student-T target distributions.

Stan Model for Student-T Targets

## data {
##   int<lower=0> d;
##   real<lower=0> nu;
##   matrix[d, d] S;
## }
## 
## transformed data {
##   vector[d] mu = rep_vector(0.0, d);
##   real<lower=0.> nnu = 0.5 * nu;
## }
## 
## parameters {
##   vector[d] mvnormal;
##   real<lower=0> invgamma;
## }
## 
## transformed parameters {
##   vector[d] x = mvnormal * sqrt(invgamma);;
## }
## 
## model {
##   mvnormal ~ multi_normal(mu, S);
##   invgamma ~ inv_gamma(nnu, nnu);
## }

Generate \(\Sigma\)

The following code generates \(\Sigma\) dependent on the dimension \(d\).

gen_sigma <- function(d) {
    eig <- sort(rgamma(d, 0.5, 1))
    QR <- qr(matrix(rnorm(d * d), nrow = d))
    Q <- qr.Q(QR)
    Q %*% diag(1 / eig) %*% t(Q)
}

German Credit Dataset

The plots below show ESS per second and ESS per leapfrog, respectively. Note that the y-axis is log base 10 transformed. Only values for the coefficient vector \(\beta\) are shown. There is no obvious sign that the proposed dynamics have significantly altered the overall effectiveness of Stan’s HMC algorithm for this sparse logistic regression target distribution.

Stan Model for German Credit Dataset Target

## // Adapted from
## // https://github.com/google-research/google-research/blob/master/neutra/logistic_reg_pystan.py
## //
## // Copyright 2020 The Google Research Authors.
## //
## // Licensed under the Apache License, Version 2.0 (the "License");
## // you may not use this file except in compliance with the License.
## // You may obtain a copy of the License at
## //
## //     http://www.apache.org/licenses/LICENSE-2.0
## //
## // Unless required by applicable law or agreed to in writing, software
## // distributed under the License is distributed on an "AS IS" BASIS,
## // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
## // See the License for the specific language governing permissions and
## // limitations under the License.
## 
## data {
##   int<lower=0> n;               // number of observations
##   int<lower=0> d;               // number of predictors
##   int<lower=0,upper=1> y[n];    // outputs
##   matrix[n,d] x;                // inputs
## }
## parameters {
##   vector[d] z;
##   vector<lower=0>[d] local_scale;
##   real<lower=0> global_scale;
## }
## transformed parameters {
##   vector[d] beta;               // regression coefficients
##   vector[n] f;                  // latent values
## 
##   beta = z .* local_scale * global_scale;
##   f = x * beta;
## }
## model {
##   z ~ normal(0, 1);
##   local_scale ~ gamma(0.5, 0.5);
##   global_scale ~ gamma(0.5, 0.5);
## 
##   y ~ bernoulli_logit(f);
## }

Discussion

This technical report reduces the number of necessary calculations performed in the Hamiltonian dynamics of Stan’s HMC algorithm in an attempt to minimize the computational cost, decrease the run time of the algorithm, and therefore increase the overall efficiency of sampling from a target distribution. Despite the computational savings of the proposed dynamics, there is no obvious sign of any change in the overall efficiency of the HMC algorithm in Stan v2.23.0, as measured by effective sample size per second of total run time.

While completing this report, I discovered the paper Magnetic Hamiltonian Monte Carlo (Tripuraneni et al. 2017). The dynamics proposed therein cover and extend the dynamics propsed here. In the extension, Tripuraneni et al further parameterize the skew symmetric matrix \(Q\), which they believe has the potential to improve pre-conditioning of the MCMC kernel of HMC, if only the extra parameters could be properly tuned. It could be interesting to evaluate the general training procedure used in Levy, Hoffman, and Sohl-Dickstein (2017), which was adapted from Pasarica and Gelman (2010), to tune these extra parameters in \(Q\).

License

Copyright (c) 2020, Edward A. Roualdes CC-BY

References

Betancourt, Michael. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.” arXiv Preprint arXiv:1701.02434.

Hoffman, Matthew, Pavel Sountsov, Joshua V Dillon, Ian Langmore, Dustin Tran, and Srinivas Vasudevan. 2019. “NeuTra-Lizing Bad Geometry in Hamiltonian Monte Carlo Using Neural Transport.” arXiv Preprint arXiv:1903.03704.

Leimkuhler, Benedict, Charles Matthews, and Jonathan Weare. 2018. “Ensemble Preconditioning for Markov Chain Monte Carlo Simulation.” Statistics and Computing 28 (2). Springer: 277–90.

Levy, Daniel, Matthew D Hoffman, and Jascha Sohl-Dickstein. 2017. “Generalizing Hamiltonian Monte Carlo with Neural Networks.” arXiv Preprint arXiv:1711.09268.

Ma, Yi-An, Tianqi Chen, and Emily Fox. 2015. “A Complete Recipe for Stochastic Gradient Mcmc.” In Advances in Neural Information Processing Systems, 2917–25.

Neal, Radford M. 2012. “MCMC Using Hamiltonian Dynamics. Published as Chapter 5 of the Handbook of Markov Chain Monte Carlo, 2011.” arXiv Preprint arXiv:1206.1901.

Pasarica, Cristian, and Andrew Gelman. 2010. “Adaptively Scaling the Metropolis Algorithm Using Expected Squared Jumped Distance.” Statistica Sinica. JSTOR, 343–64.

Team, Stan Development. 2018. “The Stan Core Library.” http://mc-stan.org.

Tripuraneni, Nilesh, Mark Rowland, Zoubin Ghahramani, and Richard Turner. 2017. “Magnetic Hamiltonian Monte Carlo.” In Proceedings of the 34th International Conference on Machine Learning-Volume 70, 3453–61. JMLR. org.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2019. “Rank-Normalization, Folding, and Localization: An Improved ˆ R for Assessing Convergence of Mcmc.” arXiv Preprint arXiv:1903.08008 18.


  1. They also offer a stochastic version of their framework, but the stochastics is of no interest here.

  2. Thanks to Matija Rezar and Ben Bales for pointing this out.