^{1}, B. K. Sin

^{1}, A. R. Mohazab

^{1,a)}and S. S. Plotkin

^{1,b)}

### Abstract

Computer simulations can provide critical information on the unfolded ensemble of proteins under physiological conditions, by explicitly characterizing the geometrical properties of the diverse conformations that are sampled in the unfolded state. A general computational analysis across many proteins has not been implemented however. Here, we develop a method for generating a diverse conformational ensemble, to characterize properties of the unfolded states of intrinsically disordered or intrinsically folded proteins. The method allows unfolded proteins to retain disulfide bonds. We examined physical properties of the unfolded ensembles of several proteins, including chemical shifts, clustering properties, and scaling exponents for the radius of gyration with polymer length. A problem relating simulated and experimental residual dipolar couplings is discussed. We apply our generated ensembles to the problem of folding kinetics, by examining whether the ensembles of some proteins are closer geometrically to their folded structures than others. We find that for a randomly selected dataset of 15 non-homologous 2- and 3-state proteins, quantities such as the average root mean squared deviation between the folded structure and unfolded ensemble correlate with folding rates as strongly as absolute contact order. We introduce a new order parameter that measures the distance travelled per residue, which naturally partitions into a smooth “laminar” and subsequent “turbulent” part of the trajectory. This latter conceptually simple measure with no fitting parameters predicts folding rates in 0 M denaturant with remarkable accuracy (r = −0.95, p = 1 × 10−7). The high correlation between folding times and sterically modulated, reconfigurational motion supports the rapid collapse of proteins prior to the transition state as a generic feature in the folding of both two-state and multi-state proteins. This method for generating unfolded ensembles provides a powerful approach to address various questions in protein evolution, misfolding and aggregation, transient structures, and molten globule and disordered protein phases.

S.S.P acknowledges funding support from PrioNet Canada, NSERC, and computational support from the WestGrid high-performance computing consortium.

I. INTRODUCTION

II. METHODS

A. Generating diverse ensembles of unfolded configurations

1. Pivot and crankshaft moves

2. Foliation, minimization, and equilibration

B. Chemical shifts and residual dipolar couplings

C. Investigating the correlation between geometrical folding pathways and folding kinetics

1. Order parameters for unfolded structures

III. RESULTS AND DISCUSSION

A. Chemical shifts and residual dipolar couplings

B. Polymer scaling laws and persistence length

C. Clusteringanalysis

D. Correlations between geometrical folding pathways and folding kinetics

IV. CONCLUSIONS

### Key Topics

- Proteins
- 83.0
- Protein folding
- 36.0
- Polymers
- 15.0
- Chemical shifts
- 11.0
- Conformational dynamics
- 11.0

##### G06F19/00

## Figures

Overview of algorithm.

Overview of algorithm.

Illustration of an example pivot move for PDB 1L2Y.

Illustration of an example pivot move for PDB 1L2Y.

Crankshaft move for SOD1 (PDB 1HL5), a protein with a long-range disulfide bond between C57 and C146. A minimized, non-equilibrated configuration is shown.

Crankshaft move for SOD1 (PDB 1HL5), a protein with a long-range disulfide bond between C57 and C146. A minimized, non-equilibrated configuration is shown.

Schematic figures indicating the processes of backbone and side-chain addition, energy minimization, and 1 ns thermal equilibration.

Schematic figures indicating the processes of backbone and side-chain addition, energy minimization, and 1 ns thermal equilibration.

Example optimal folding trajectories for 5 C α atoms in apo-myoglobin (1A6N). Unfolded and folded structures are also shown.

Example optimal folding trajectories for 5 C α atoms in apo-myoglobin (1A6N). Unfolded and folded structures are also shown.

Each C α trajectory is divided into a smooth “laminar” and rugged “turbulent” part. Panels (a) and (b) show sample trajectories for C α(4) and C α(75) of apo-myoglobin. Panel (a) is predominantly laminar – the corresponding distances are Å, Å. Panel (b) is predominantly turbulent – the corresponding distances are Å, Å. (c) Criterion for determining the transition from laminar to turbulent trajectories. When the root variance in the distance travelled per step jumps above a threshold given by 7 times the baseline value, the trajectory from then on is defined as turbulent.

Each C α trajectory is divided into a smooth “laminar” and rugged “turbulent” part. Panels (a) and (b) show sample trajectories for C α(4) and C α(75) of apo-myoglobin. Panel (a) is predominantly laminar – the corresponding distances are Å, Å. Panel (b) is predominantly turbulent – the corresponding distances are Å, Å. (c) Criterion for determining the transition from laminar to turbulent trajectories. When the root variance in the distance travelled per step jumps above a threshold given by 7 times the baseline value, the trajectory from then on is defined as turbulent.

Different ensembles considered in this study to compare with protein folding kinetics.

Different ensembles considered in this study to compare with protein folding kinetics.

(Panel (a)) TM-score distributions between native structures, showing homology of our dataset compared to a NR dataset, 56 and other datasets used for protein folding kinetics analysis. 84,118 One can see some homologous protein pairs in other datasets. (Panel (b)) TM-score distribution between 1299 unfolded states for α-synuclein. Similar distributions are obtained for other proteins.

(Panel (a)) TM-score distributions between native structures, showing homology of our dataset compared to a NR dataset, 56 and other datasets used for protein folding kinetics analysis. 84,118 One can see some homologous protein pairs in other datasets. (Panel (b)) TM-score distribution between 1299 unfolded states for α-synuclein. Similar distributions are obtained for other proteins.

Comparison between experimental and simulated 13 C α chemical shift values, for Aβ1−42. (Main panel) Black data points are experimental values from Ref. 57 , red data points are those from the simulated ensemble of 773 conformations, using CAMSHIFT. (Inset (a)) Scatter plot of experimental vs simulated chemical shifts (r = 0.93). (Panel (b)) Convergence study of the correlation coefficient between experimental and simulated data. Mean correlation coefficient is shown; vertical bars indicate the standard deviation of correlation coefficient values when random subsets with a given number of frames are taken from the total dataset.

Comparison between experimental and simulated 13 C α chemical shift values, for Aβ1−42. (Main panel) Black data points are experimental values from Ref. 57 , red data points are those from the simulated ensemble of 773 conformations, using CAMSHIFT. (Inset (a)) Scatter plot of experimental vs simulated chemical shifts (r = 0.93). (Panel (b)) Convergence study of the correlation coefficient between experimental and simulated data. Mean correlation coefficient is shown; vertical bars indicate the standard deviation of correlation coefficient values when random subsets with a given number of frames are taken from the total dataset.

(a) Radius of gyration vs. time (equilibration process), for proTα: a highly charged, intrinsically disordered protein. The relaxation time is about 0.8 ns, and the asymptotic value of the radius of gyration R G is about 35.5 Å. (b) Scaling of the radius of gyration R G with chain length, obtained by taking all subsections of a given length and finding the ensemble averaged radius of gyration. (Inset) Extrapolation procedure to find the asymptotic value of the scaling exponent ν. The value of ν is obtained for ensembles at a given equilibration time. This value converges exponentially to the t → ∞ value. Extrapolation from ensembles with t ⩽ 1 ns gives an asymptotic value of 0.633, while extrapolation from ensembles with t ⩽ 5 ns gives an asymptotic value of 0.631. A similar conclusion was obtained from extrapolation of the data for α-syn. Thus, extrapolation of ν from t ⩽ 1 ns ensembles is likely to be sufficiently accurate in general.

(a) Radius of gyration vs. time (equilibration process), for proTα: a highly charged, intrinsically disordered protein. The relaxation time is about 0.8 ns, and the asymptotic value of the radius of gyration R G is about 35.5 Å. (b) Scaling of the radius of gyration R G with chain length, obtained by taking all subsections of a given length and finding the ensemble averaged radius of gyration. (Inset) Extrapolation procedure to find the asymptotic value of the scaling exponent ν. The value of ν is obtained for ensembles at a given equilibration time. This value converges exponentially to the t → ∞ value. Extrapolation from ensembles with t ⩽ 1 ns gives an asymptotic value of 0.633, while extrapolation from ensembles with t ⩽ 5 ns gives an asymptotic value of 0.631. A similar conclusion was obtained from extrapolation of the data for α-syn. Thus, extrapolation of ν from t ⩽ 1 ns ensembles is likely to be sufficiently accurate in general.

Nearest neighbor clustering using TM-score of 1299 structures of α-synuclein, projected onto the TM-scores to the centroid structures of the largest three clusters (blue, red, and black, respectively). Representative conformations in each cluster are shown. The lack of distinct clustering indicates diverse sampling of the unfolded ensemble.

Nearest neighbor clustering using TM-score of 1299 structures of α-synuclein, projected onto the TM-scores to the centroid structures of the largest three clusters (blue, red, and black, respectively). Representative conformations in each cluster are shown. The lack of distinct clustering indicates diverse sampling of the unfolded ensemble.

(a) Scatter plot of the absolute contact order (ACO) and average laminar distance (equilibrium ensemble, with smoothed trajectories), for the 15 natively folded proteins in Table I . 2-state proteins (blue squares) and 3-state proteins (red triangles) are well-clustered by , but not by ACO, as can be seen by inspection, i.e., by projecting data onto each order parameter. Closed curves circumscribing each class of protein are a guide to the eye. (b) Statistical significance (p-values) that the various metrics for 2-state and 3-state folders arise from different distributions, as determined by t-test. 30 −log(p) is plotted, so that a higher number indicates better ability to distinguish between the two classes. The dashed black horizontal line indicates a threshold of 5% for statistical significance. Only ACO and maxcluster-determined TM-score fail to distinguish 2-state from 3-state folders. Error bars for ACO and are obtained by removing 1 data point at random from the dataset, recomputing −log(p), and then calculating the standard deviation for the resulting collection of values. Notation used in this panel is further described in Figure 14 .

(a) Scatter plot of the absolute contact order (ACO) and average laminar distance (equilibrium ensemble, with smoothed trajectories), for the 15 natively folded proteins in Table I . 2-state proteins (blue squares) and 3-state proteins (red triangles) are well-clustered by , but not by ACO, as can be seen by inspection, i.e., by projecting data onto each order parameter. Closed curves circumscribing each class of protein are a guide to the eye. (b) Statistical significance (p-values) that the various metrics for 2-state and 3-state folders arise from different distributions, as determined by t-test. 30 −log(p) is plotted, so that a higher number indicates better ability to distinguish between the two classes. The dashed black horizontal line indicates a threshold of 5% for statistical significance. Only ACO and maxcluster-determined TM-score fail to distinguish 2-state from 3-state folders. Error bars for ACO and are obtained by removing 1 data point at random from the dataset, recomputing −log(p), and then calculating the standard deviation for the resulting collection of values. Notation used in this panel is further described in Figure 14 .

Optimal folding trajectory of C α(50) in apo-myoglobin (1A6N). The trajectory is curved, due to steric constraints with the remainder of the protein. C α(50) is shown as blue spheres in the initial and final states. The region of protein N-terminal to C α(50) in the initial unfolded state is shown in red. This transforms to the short helix N-terminal to C α(50) in the final position.

Optimal folding trajectory of C α(50) in apo-myoglobin (1A6N). The trajectory is curved, due to steric constraints with the remainder of the protein. C α(50) is shown as blue spheres in the initial and final states. The region of protein N-terminal to C α(50) in the initial unfolded state is shown in red. This transforms to the short helix N-terminal to C α(50) in the final position.

Correlation matrix for all geometrical parameters, as well as experimental folding rates. The upper triangular elements are Pearson correlation coefficients. The lower triangular elements are the corresponding statistical significance values, which are represented as −log10 so that, e.g., 4.5 corresponds to p = 10−4.5 = 3.2× 10−5. Red represents strong positive correlation; blue represents strong negative correlation. “_raw” indicates numbers taken from the raw trajectory, while “_smooth” indicates numbers taken from the smoothed trajectory. Trajectories are further divided into “_laminar” and “_turbulent” parts. Initial ensembles are either equilibrated “_equil,” or pre-equilibration (energy minimized only or “_min”). Other parameters shown include ACO, protein length, GDT-TS, TM-score, natural log of the folding and unfolding rates in 0 M denaturant, and natural log of relaxation rate at the transition midpoint.

Correlation matrix for all geometrical parameters, as well as experimental folding rates. The upper triangular elements are Pearson correlation coefficients. The lower triangular elements are the corresponding statistical significance values, which are represented as −log10 so that, e.g., 4.5 corresponds to p = 10−4.5 = 3.2× 10−5. Red represents strong positive correlation; blue represents strong negative correlation. “_raw” indicates numbers taken from the raw trajectory, while “_smooth” indicates numbers taken from the smoothed trajectory. Trajectories are further divided into “_laminar” and “_turbulent” parts. Initial ensembles are either equilibrated “_equil,” or pre-equilibration (energy minimized only or “_min”). Other parameters shown include ACO, protein length, GDT-TS, TM-score, natural log of the folding and unfolding rates in 0 M denaturant, and natural log of relaxation rate at the transition midpoint.

(Panel (a)) Correlation of various distance metrics with experimental refolding rate in water, for the dataset of proteins listed in Table I . Raw (rather than smoothed) data are taken here. Minus the log base 10 of the statistical significance is plotted, and the horizontal dashed line gives the threshold of statistical significance (p = 0.05). The best predictor of folding rates in water, the turbulent distance, has a significance of 10−7. Each integer below this value in the plot corresponds to a decrease in significance by an order of magnitude. (Panel (b)) Same as panel (a) but for experimental unfolding rate in water. Here, ACO emerges as the strongest correlator of unfolding rate. (Panel (c)) Same as panel (a) but for relaxation rate at the transition midpoint. Here, several variants of the distance travelled correlate best with relaxation rate, e.g., both and have a correlation coefficient r = −0.84.

(Panel (a)) Correlation of various distance metrics with experimental refolding rate in water, for the dataset of proteins listed in Table I . Raw (rather than smoothed) data are taken here. Minus the log base 10 of the statistical significance is plotted, and the horizontal dashed line gives the threshold of statistical significance (p = 0.05). The best predictor of folding rates in water, the turbulent distance, has a significance of 10−7. Each integer below this value in the plot corresponds to a decrease in significance by an order of magnitude. (Panel (b)) Same as panel (a) but for experimental unfolding rate in water. Here, ACO emerges as the strongest correlator of unfolding rate. (Panel (c)) Same as panel (a) but for relaxation rate at the transition midpoint. Here, several variants of the distance travelled correlate best with relaxation rate, e.g., both and have a correlation coefficient r = −0.84.

(Panel (a)) Scatter plot of experimental folding rate at 0 M denaturant with the unfolded ensemble-averaged turbulent distance travelled during folding, corresponding to late-stage protein reconfiguration of structured elements. (Panel (b)) Scatter plot of the folding rate at 0 M denaturant with the ensemble-averaged RMSD between unfolded structures and the native. For both plots, the pre-equilibrated, energy-minimized, ensemble is taken, and raw rather than smoothed data are taken. Data for 2-state proteins are shown as squares, data for 3-state proteins are shown as triangles.

(Panel (a)) Scatter plot of experimental folding rate at 0 M denaturant with the unfolded ensemble-averaged turbulent distance travelled during folding, corresponding to late-stage protein reconfiguration of structured elements. (Panel (b)) Scatter plot of the folding rate at 0 M denaturant with the ensemble-averaged RMSD between unfolded structures and the native. For both plots, the pre-equilibrated, energy-minimized, ensemble is taken, and raw rather than smoothed data are taken. Data for 2-state proteins are shown as squares, data for 3-state proteins are shown as triangles.

## Tables

Article metrics loading...

Full text loading...

Commenting has been disabled for this content