msmbuilder.msm.BayesianMarkovStateModel

class msmbuilder.msm.BayesianMarkovStateModel(lag_time=1, n_samples=100, n_steps=0, n_chains=None, n_timescales=None, reversible=True, ergodic_cutoff='on', prior_counts=0, sliding_window=True, random_state=None, sampler='metzner', verbose=False)

Bayesian reversible Markov state model.

Variant of MarkovStateModel which estimates a distribution over transition matrices instead of a single transition matrix using Metropolis Markov chain Monte Carlo. This distribution gives information about the statistical uncertainty in the transition matrix (and functions of the transition matrix), and is stored in all_transmats_

Parameters:
  • lag_time (int) – The lag time of the model
  • n_samples (int, default=100) – Total number of transition matrices to sample from the posterior
  • n_steps (int, default=n_states) – Number of MCMC steps to take between sampled transition matrices. By default, we use n_steps=n_states_**2.
  • n_chains (int, default=n_procs) – Number of independent Markov chains to simulate. The requested number of transition matrix samples will be generated from n_chains independent MCMC chains.
  • n_timescales (int, optional) – The number of dynamical timescales to calculate when diagonalizing the transition matrix.
  • reversible (bool, default=True) – Enforce reversibility during transition matrix sampling
  • ergodic_cutoff (int, default=1) – Only the maximal strongly ergodic subgraph of the data is used to build an MSM. Ergodicity is determined by ensuring that each state is accessible from each other state via one or more paths involving edges with a number of observed directed counts greater than or equal to ergodic_cutoff. Not that by setting ergodic_cutoff to 0, this trimming is effectively turned off.
  • prior_counts (float, optional) – Add a number of “pseudo counts” to each entry in the counts matrix. When prior_counts == 0 (default), the assigned transition probability between two states with no observed transitions will be zero, whereas when prior_counts > 0, even this unobserved transitions will be given nonzero probability.
  • sliding_window (bool, optional) – Count transitions using a window of length lag_time, which is slid along the sequences 1 unit at a time, yielding transitions which contain more data but cannot be assumed to be statistically independent. Otherwise, the sequences are simply subsampled at an interval of lag_time.
  • random_state (int or RandomState instance or None (default)) – Pseudo Random Number generator seed control. If None, use the numpy.random singleton.
  • sampler ({'metzner', 'metzner_py'}) – The sampler implementation to use. ‘metzer’ is the sampler from Ref. [1] implemented in C, ‘metzner_py’ is a pure-python reference implementation.
  • verbose (bool) – Enable verbose printout
n_states_

int – The number of states in the model

mapping_

dict – Mapping between “input” labels and internal state indices used by the counts and transition matrix for this Markov state model. Input states need not necessarily be integers in (0, ..., n_states_ - 1), for example. The semantics of mapping_[i] = j is that state i from the “input space” is represented by the index j in this MSM.

countsmat_

array_like, shape = (n_states_, n_states_) – Number of transition counts between states. countsmat_[i, j] is counted during fit(). The indices i and j are the “internal” indices described above. No correction for reversibility is made to this matrix.

transmats_

array_like, shape = (n_samples, n_states_, n_states_) – Samples from the posterior ensemble of transition matrices.

Notes

Markov chain Monte Carlo can be computationally expensive. To get good (converged) results and acceptable performance, you’ll likely need to play around with the n_samples, n_steps and n_chains parameters. n_samples gives the total number of transition matrices sampled from the posterior. These samples are generated from n_chains different independent MCMC chains, at an interval of n_steps. The total number of iterations of MCMC performed during fit() is n_samples * n_steps. Increasing n_chains therefore does not alter the total number of iterations – instead it controls whether those iterations occur as part of one long chain or multiple shorter chains (which are run in parallel for sampler=='metzner').

References

[1]P. Metzner, F. Noe and C. Schutte, “Estimating the sampling error: Distribution of transition matrices and functions of transition matrices for given trajectory data.” Phys. Rev. E 80 021106 (2009)
__init__(lag_time=1, n_samples=100, n_steps=0, n_chains=None, n_timescales=None, reversible=True, ergodic_cutoff='on', prior_counts=0, sliding_window=True, random_state=None, sampler='metzner', verbose=False)

Methods

__init__([lag_time, n_samples, n_steps, ...])
fit(sequences[, y])
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
inverse_transform(sequences) Transform a list of sequences from internal indexing into
partial_transform(sequence[, mode]) Transform a sequence to internal indexing
set_params(\*\*params) Set the parameters of this estimator.
summarize()
transform(sequences[, mode]) Transform a list of sequences to internal indexing

Attributes

all_eigenvalues_ Eigenvalues of the transition matrices.
all_left_eigenvectors_ Left eigenvectors, \(\Phi\), of each transition matrix in the
all_populations_
all_right_eigenvectors_ Right eigenvectors, \(\Psi\), of each transition matrix in the
all_timescales_ Implied relaxation timescales each sample in the ensemble
all_eigenvalues_

Eigenvalues of the transition matrices.

Returns:eigs – The eigenvalues of each transition matrix in the ensemble
Return type:array-like, shape = (n_samples, n_timescales+1)
all_left_eigenvectors_

Left eigenvectors, \(\Phi\), of each transition matrix in the ensemble

Each transition matrix’s left eigenvectors are normalized such that:

  • lv[:, 0] is the equilibrium populations and is normalized such that sum(lv[:, 0]) == 1`
  • The eigenvectors satisfy sum(lv[:, i] * lv[:, i] / model.populations_) == 1. In math notation, this is \(<\phi_i, \phi_i>_{\mu^{-1}} = 1\)
Returns:lv – The columns of lv, lv[:, i], are the left eigenvectors of transmat_.
Return type:array-like, shape=(n_samples, n_states, n_timescales+1)
all_right_eigenvectors_

Right eigenvectors, \(\Psi\), of each transition matrix in the ensemble

Each transition matrix’s left eigenvectors are normalized such that:

  • Weighted by the stationary distribution, the right eigenvectors are normalized to 1. That is,

    sum(rv[:, i] * rv[:, i] * self.populations_) == 1,

    or \(<\psi_i, \psi_i>_{\mu} = 1\)

Returns:rv – The columns of lv, rv[:, i], are the right eigenvectors of transmat_.
Return type:array-like, shape=(n_samples, n_states, n_timescales+1)
all_timescales_

Implied relaxation timescales each sample in the ensemble

Returns:timescales – The longest implied relaxation timescales of the each sample in the ensemble of transition matrices, expressed in units of time-step between indices in the source data supplied to fit().
Return type:array-like, shape = (n_samples, n_timescales,)

References

[1]Prinz, Jan-Hendrik, et al. “Markov models of molecular kinetics:

Generation and validation.” J. Chem. Phys. 134.17 (2011): 174105.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (numpy array of shape [n_samples, n_features]) – Training set.
  • y (numpy array of shape [n_samples]) – Target values.
Returns:

X_new – Transformed array.

Return type:

numpy array of shape [n_samples, n_features_new]

get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
inverse_transform(sequences)

Transform a list of sequences from internal indexing into labels

Parameters:sequences (list) – List of sequences, each of which is one-dimensional array of integers in 0, ..., n_states_ - 1.
Returns:sequences – List of sequences, each of which is one-dimensional array of labels.
Return type:list
partial_transform(sequence, mode='clip')

Transform a sequence to internal indexing

Recall that sequence can be arbitrary labels, whereas transmat_ and countsmat_ are indexed with integers between 0 and n_states - 1. This methods maps a set of sequences from the labels onto this internal indexing.

Parameters:
  • sequence (array-like) – A 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.
  • mode ({'clip', 'fill'}) –
    Method by which to treat labels in sequence which do not have
    a corresponding index. This can be due, for example, to the ergodic trimming step.
    clip
    Unmapped labels are removed during transform. If they occur at the beginning or end of a sequence, the resulting transformed sequence will be shorted. If they occur in the middle of a sequence, that sequence will be broken into two (or more) sequences. (Default)
    fill
    Unmapped labels will be replaced with NaN, to signal missing data. [The use of NaN to signal missing data is not fantastic, but it’s consistent with current behavior of the pandas library.]
Returns:

mapped_sequence – If mode is “fill”, return an ndarray in internal indexing. If mode is “clip”, return a list of ndarrays each in internal indexing.

Return type:

list or ndarray

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:self
transform(sequences, mode='clip')

Transform a list of sequences to internal indexing

Recall that sequences can be arbitrary labels, whereas transmat_ and countsmat_ are indexed with integers between 0 and n_states - 1. This methods maps a set of sequences from the labels onto this internal indexing.

Parameters:
  • sequences (list of array-like) – List of sequences, or a single sequence. Each sequence should be a 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.
  • mode ({'clip', 'fill'}) –
    Method by which to treat labels in sequences which do not have
    a corresponding index. This can be due, for example, to the ergodic trimming step.
    clip
    Unmapped labels are removed during transform. If they occur at the beginning or end of a sequence, the resulting transformed sequence will be shorted. If they occur in the middle of a sequence, that sequence will be broken into two (or more) sequences. (Default)
    fill
    Unmapped labels will be replaced with NaN, to signal missing data. [The use of NaN to signal missing data is not fantastic, but it’s consistent with current behavior of the pandas library.]
Returns:

mapped_sequences – List of sequences in internal indexing

Return type:

list