^{1,a)}, Raymond Z. Cui

^{2}, Gregory R. Bowman

^{3}, Daniel-Adriano Silva

^{2}, Jian Sun

^{4}and Xuhui Huang

^{2,a)}

### Abstract

Markov state models (MSMs) have become a popular approach for investigating the conformational dynamics of proteins and other biomolecules. MSMs are typically built from numerous molecular dynamics simulations by dividing the sampled configurations into a large number of microstates based on geometric criteria. The resulting microstate model can then be coarse-grained into a more understandable macrostate model by lumping together rapidly mixing microstates into larger, metastable aggregates. However, finite sampling often results in the creation of many poorly sampled microstates. During coarse-graining, these states are mistakenly identified as being kinetically important because transitions to/from them appear to be slow. In this paper, we propose a formalism based on an algebraic principle for matrix approximation, i.e., the Nyström method, to deal with such poorly sampled microstates. Our scheme builds a hierarchy of microstates from high to low populations and progressively applies spectral clustering on sets of microstates within each level of the hierarchy. It helps spectral clustering identify metastable aggregates with highly populated microstates rather than being distracted by lowly populated states. We demonstrate the ability of this algorithm to discover the major metastable states on two model systems, the alanine dipeptide and trpzip2 peptide.

We acknowledge the support from the National Basic Research Program of China (973 Program 2011CB809105, 2012CB825501 to Y.Y. and 2013CB834703 to X.H.), NSFC (61071157 to Y.Y. and 21273188 to X.H.), Microsoft Research Asia, a professorship in the Hundred Talents Program at Peking University to Y.Y., Hong Kong Research Grants Council GRF 661011, F-HK29/11T, HKUST2/CRF/10 to X.H., and University Grants Council SBI12SC01 to X.H. G.R.B. is supported by the Miller Institute.

I. INTRODUCTION

II. METHODS

A. Kinetic lumping with spectral methods

B. Nyström method

1. Application to symmetric matrix

2. Application to transition probability matrix

3. Recovery of original Markov chains

C. Multilevel analysis on free energy landscapes

D. Simulation details

III. RESULTS

A. Alanine-dipeptide

B. Trpzip2

IV. CONCLUSION AND DISCUSSION

### Key Topics

- Free energy
- 26.0
- Eigenvalues
- 24.0
- Conformational dynamics
- 19.0
- Cluster analysis
- 18.0
- Peptides
- 11.0

##### B01F3/00

## Figures

Schematic figure for the multilevel analysis of a free energy landscape. (a) a 1D free energy landscape divided into four levels; (b) a cluster tree representing this free energy landscape. At level one, two nodes are formed that correspond to the two deepest free energy minima. At level two, four nodes are identified for four free energy minima. At level three and four, the number of nodes are reduced since some free energy minima are connected.

Schematic figure for the multilevel analysis of a free energy landscape. (a) a 1D free energy landscape divided into four levels; (b) a cluster tree representing this free energy landscape. At level one, two nodes are formed that correspond to the two deepest free energy minima. At level two, four nodes are identified for four free energy minima. At level three and four, the number of nodes are reduced since some free energy minima are connected.

An illustration of Hierarchical Nyström Extension Graph (HNEG) applied to the alanine dipeptide system. (a) A conformation of the alanine dipeptide with two torsion angels (ϕ and ψ) labeled. (b) The free energy landscape projected onto the ϕ − ψ plane, where the red color indicates regions of high density or low free energy. (c) A Hierarchical Nyström Extension Graph containing 9 levels constructed for this system.

An illustration of Hierarchical Nyström Extension Graph (HNEG) applied to the alanine dipeptide system. (a) A conformation of the alanine dipeptide with two torsion angels (ϕ and ψ) labeled. (b) The free energy landscape projected onto the ϕ − ψ plane, where the red color indicates regions of high density or low free energy. (c) A Hierarchical Nyström Extension Graph containing 9 levels constructed for this system.

Bayesian comparison of MSMs constructed by the Hierarchical Nyström Extension Graph (HNEG) and the PCCA method for the alanine dipeptide. The y axis displays the logarithm of the posterior probability (ln (P(L 1∣D))) for models generated by HNEG (red) and PCCA (black). The logarithmic Bayes factor ln B = ln (P(L 1∣D)) HNEG − ln (P(L 2∣D) PCCA ) ≳ 100, indicating that HNEG consistently provides a better lumping than PCCA. The lag time is 9 ps.

Bayesian comparison of MSMs constructed by the Hierarchical Nyström Extension Graph (HNEG) and the PCCA method for the alanine dipeptide. The y axis displays the logarithm of the posterior probability (ln (P(L 1∣D))) for models generated by HNEG (red) and PCCA (black). The logarithmic Bayes factor ln B = ln (P(L 1∣D)) HNEG − ln (P(L 2∣D) PCCA ) ≳ 100, indicating that HNEG consistently provides a better lumping than PCCA. The lag time is 9 ps.

Bayesian comparison of MSMs constructed by the Hierarchical Nyström Extension Graph (HNEG) and the PCCA method for the trpzip2 peptide. The y axis displays the logarithm of the posterior probability (ln (P(L 1∣D))) for models generated by HNEG (red) and PCCA (black). The logarithmic Bayes factor ln B = ln (P(L 1∣D)) HNEG − ln (P(L 2∣D) PCCA ) ∈ [250, 500], indicating that HNEG consistently provides a better lumping than PCCA for the trpzip2 system. The lag time is 10 ns.

Bayesian comparison of MSMs constructed by the Hierarchical Nyström Extension Graph (HNEG) and the PCCA method for the trpzip2 peptide. The y axis displays the logarithm of the posterior probability (ln (P(L 1∣D))) for models generated by HNEG (red) and PCCA (black). The logarithmic Bayes factor ln B = ln (P(L 1∣D)) HNEG − ln (P(L 2∣D) PCCA ) ∈ [250, 500], indicating that HNEG consistently provides a better lumping than PCCA for the trpzip2 system. The lag time is 10 ns.

Representative structures of the 13 macrostates from the optimal lumping (with the highest posterior probability) for the trpzip2 system. Their equilibrium populations are also displayed. Macrostate 3 corresponds to a folded hairpin structure and has the largest population (38.5%), indicating that the trpzip2 peptide still has a significant fraction of the folded structure at 350 K.

Representative structures of the 13 macrostates from the optimal lumping (with the highest posterior probability) for the trpzip2 system. Their equilibrium populations are also displayed. Macrostate 3 corresponds to a folded hairpin structure and has the largest population (38.5%), indicating that the trpzip2 peptide still has a significant fraction of the folded structure at 350 K.

(a) An illustration of block diagonal structures of the microstate transition probability matrices. There are 2000 microstates in total, and the matrices are permuted to group microstates that belong to the same macrostate together. The results show that PCCA (left panel) tends to separate nearly disconnected small blocks first, while HNEG (right panel) focuses on identifying the well populated macrostates. The number of macrostates for both lumpings is 11. (b) HNEG successfully identify the large macrostates (around 9) with a much smaller total number of macrostates (<20) compared to PCCA (>80).

(a) An illustration of block diagonal structures of the microstate transition probability matrices. There are 2000 microstates in total, and the matrices are permuted to group microstates that belong to the same macrostate together. The results show that PCCA (left panel) tends to separate nearly disconnected small blocks first, while HNEG (right panel) focuses on identifying the well populated macrostates. The number of macrostates for both lumpings is 11. (b) HNEG successfully identify the large macrostates (around 9) with a much smaller total number of macrostates (<20) compared to PCCA (>80).

The Hierarchical Nyström method can robustly identify the large metastable macrostates with population greater than 1% (red), 2% (green), and 3% (blue), when varying the fraction of data that are included in the submatrix A (see Sec. II for details). The percentage of data we include(P m ) in the submatrix A is varied from 41% to 99%. The number of large macrostates keeps the same after we include 50% of the data or more.

The Hierarchical Nyström method can robustly identify the large metastable macrostates with population greater than 1% (red), 2% (green), and 3% (blue), when varying the fraction of data that are included in the submatrix A (see Sec. II for details). The percentage of data we include(P m ) in the submatrix A is varied from 41% to 99%. The number of large macrostates keeps the same after we include 50% of the data or more.

Histogram of pairwise mutual information (with a mean value of 2.104) between the optimal lumping and all other lumpings (259 of them) obtained from the Nyström method by varying the level sets (red). For comparison, the mutual information between a lumping generated by PCCA (with 13 macrostates) and the optimal lumping is only 0.785 (green). The entropy of the optimal lumping (upper limit of the mutual information) is 3.313 (black), while the averaged mutual information between random lumpings (we produced 259 random lumpings) and the optimal lumping is 0.051.

Histogram of pairwise mutual information (with a mean value of 2.104) between the optimal lumping and all other lumpings (259 of them) obtained from the Nyström method by varying the level sets (red). For comparison, the mutual information between a lumping generated by PCCA (with 13 macrostates) and the optimal lumping is only 0.785 (green). The entropy of the optimal lumping (upper limit of the mutual information) is 3.313 (black), while the averaged mutual information between random lumpings (we produced 259 random lumpings) and the optimal lumping is 0.051.

Illustration of the overlapping of microstates (or protein conformations) assigned to large macrostates in different lumpings. The optimal lumping (A) is compared with a representative lumping (B) with mutual information at around 2.1. Both of these lumpings contain 9 large macrostates states with population >2%. The joint probability matrix P A, B (i, j) indicates the overlapping of microstates assigned to macrostate i in the optimal lumping A and macrostate j in the representative lumping B. P A, B has large diagonal elements but small off-diagonal elements after permutation. These results indicate that the large macrostates in the two lumpings share a relatively large fraction of identical microstates.

Illustration of the overlapping of microstates (or protein conformations) assigned to large macrostates in different lumpings. The optimal lumping (A) is compared with a representative lumping (B) with mutual information at around 2.1. Both of these lumpings contain 9 large macrostates states with population >2%. The joint probability matrix P A, B (i, j) indicates the overlapping of microstates assigned to macrostate i in the optimal lumping A and macrostate j in the representative lumping B. P A, B has large diagonal elements but small off-diagonal elements after permutation. These results indicate that the large macrostates in the two lumpings share a relatively large fraction of identical microstates.

## Tables

Hierarchical Nyström Extension Graph (HNEG).

Hierarchical Nyström Extension Graph (HNEG).

Article metrics loading...

Full text loading...

Commenting has been disabled for this content