# Data-driven stochastic modeling Markov Chain Matlab Codes

Dr. Jesse Dorrestijn
28 Dec 2019

This page has been created in support of my PhD thesis Stochastic Convection Parameterization which I successfully defended at Delft University of Technology (Netherlands) in 2016. The aim of this page is to share Matlab Markov chain codes that I used during my studies of Markov chain modeling of the atmosphere. Markov chains can be used to simulate a process which is memoryless, i.e. future states only depend on the present and not on the past, a process which satisfies the Markov property. Data-driven Markov chains can be estimated when time series of observations of such a process are available. The main strategy to construct a finite state Markov chain is as follows:

• Given a sequence of observations of a process, clustering techniques such as k-means can bring the number of states back to a finite number of states;
• Data-inferred probabilities of transitions between those states can be estimated by counting and form a transition probability matrix;
• Correlation analysis can be used to determine which external factors affect those probabilities;
• By conditioning on those external known factors, transition probabilities can be estimated again while conditioning on those factors/variables. This results in a data-driven conditional Markov chain.

As a further reference one could consult the paper Stochastic Convection Parameterization with Markov Chains in an Intermediate Complexity GCM in which data-driven Markov chains are used to enhance the representation of atmospheric clouds in climate models. Processes in the atmosphere usually satisfy the Markov property, because future states are determined from the present state and not from the past, which explains the application of Markov chains in the scientific field.

Contents of this web page:
Example 1: Two-state Markov chains
Example 2: Markov chains with N states
Example 3: Markov chains conditioned on an external variable
Example 4: Markov chains conditioned on an extrenal variable on two time instances
Example 5: Clustering of observations
Example 6: Simultaneous clustering of 2 observations
Example 7: k-means Matlab code example: 2D clustering
Example 8: Cross-correlation analysis in Matlab

A Finite State Markov chain has a finite number of states and it switches between these states with certain probabilities. These probabilities can be put into a matrix P. If the Markov chain has 2 states, the state transition matrix is of size 2 x 2. On the diagonal are the probabilities that the state does not change in one time-step from t to t+1. The other probabilities are off the diagonal. For example, the probability that it switches from state 1 to state 2 is at the entry (1,2) in the matrix. Together with an initial value, the Markov chain can produce sequences.

A Markov chain can be used to mimic a certain process. If a process has for example only two states, and a long sequence is available, transition probabilities of the Markov chain can be estimated from this sequence.

Example 1: a two-state Markov chain
In the following example, we first construct a sequence with only two states. This sequence, which we call the observations y_obs, will be used to estimate transition probabilities of a Markov chain and put into a matrix P_MC. Finally, we will use the Markov chain to construct a sequence y_MC which is similar to the original y_obs sequence. We will plot the two sequences. The length of the original sequence L can be adjusted. Then, the probability matrix P_MC can be compared to P_obs. Also the transition probabilities can be adjusted.

-->A link to Markov Chain Matlab Code Example 1.<-- Caption: [top] a sequence of observations of a process having two states [bottom] a realization of a Markov Chain trained on these observations. Example 2: Markov chains with N small-scale states
Let us do almost the same as in Example 1. However, now we will consider Markov chains with more than 2 states. Now, let N>2 be the number of states. You can adjust this number.

-->A link to Matlab Markov Chain Example 2.<-- Caption: [top] a sequence of observations of a process having multiple states [bottom] a realization of a Markov Chain trained on these observations.

Example 3: Markov chains conditioned on a large-scale variable
Let us now introduce conditioning. Imagine that the transition probabilities depend on a certain variable X. Then, the Markov chain becomes a Conditional Markov Chain, because it is conditioned on X. The observational data now consists of a sequence y_obs (as was the case in example 1 and 2) and an additional sequence X_obs. After construction of the Conditional Markov Chain, an additional sequence X should be available, e.g. X=X_obs. In the context of climate or weather models, the large-scale variable X can for example be the average surface temperature in an area: a variable that is known to the model. The small-scale variable y, for example the convective area fraction, is not known to the model and can be represented by a Markov chain.

-->A link to Markov Chain Matlab Code Example 3.<-- Caption: [top] Sequence of observations of a process having multiple states (blue line) and the states of a large-scale variable/background variable (red dashed line) [center] a realization of a Markov Chain trained on these observations [bottom] a realization of a Conditional Markov Chain trained on these observations.

Example 4: Markov chains conditioned on a large-scale variable on two time instances.
Let us now additionally condition on X(t+1). In the previous example, the transition probabilities were conditioned only on X(t). The transition probability of the Conditional Markov chain to switch from state k to state l will now be: P(Y_CMC(t+1)=l | Y_CMC(t) = k & X(t)=m & X(t+1)=n). For each combination of X(t) and X(t+1) there will be a state transition matrix for Y_CMC.

-->A link to Markov Chain Matlab Code Example 4.<-- Caption: [top] Sequence of observations of a process having multiple states (blue line) and the states of a large-scale variable/background variable (red dashed line) [center] a realization of a Markov Chain trained on these observations [bottom] a realization of a Conditional Markov Chain trained on these observations, conditioned on state transitions of the large-scale variable.

Example 5: Clustering of observations.
To use a finite state Markov chain, the observational sequence needs to be classified in a finite number of states. Also the variable X needs to be classified in a finite number of states. We use k-means.

-->A link to Markov Chain Matlab Code Example 5.<--   Caption: [top] Sequence of observations of a process having multiple states (blue line) and observations of a large-scale variable/background variable (red line) [center] discretized version of the observations using k-means for both the small-scale and large-scale sequences [bottom] a realization of a Conditional Markov Chain trained on these observations.

Example 6: Simultaneous clustering of 2 observations
Same as previous example, but with two variables to condition on: X and Z. We use k-means for 2D clustering of (X,Z). The large-scale state is determined by the values of X and Z.

-->A link to Matlab Markov Chain Example 6.<--  Caption: [top] Sequence of observations of a process (blue line) and observations of two large-scale variable/background variable (red and green line) [bottom] a realization of a Conditional Markov Chain trained on these observations, with simultaneously clustered large-scale variables. Note that the outliers indicate that there is room for improvement of this particular Markov chain (i.e more data is needed).

Example 7: Extra simple example of 2D clustering
Using k-means for 2D clustering of two normally distributed variables.

-->A link to Matlab: K-means Clustering Code Example 7.<-- Caption: K-means clustering of data stemming from two independent normally distributed variables (colored pixels) and the ten cluster centroids (red stars).

Example 8: Cross-correlation analysis
Using cross-correlation analysis to determine which variable can best be used to condition on. In the example figure, X_ori displays much stronger correlation than Z_ori: therefore, X_ori can be chosen to condition on instead of Z_ori.

-->A link to Code Example 8: Matlab Cross-Correlation Analysis.<--  Caption: [top] time series of a process y and two large-scale variables X and Z [bottom] the cross-correlation between y and X (red line) and between y and Z (green line) as a function of time lag. The variable X does not lag and can be used as a variable to condition on (better than Z). Observe that y lags Z, which means that if one wants to condition on Z, one can better condition on values of Z from the past than from the present, still X is better to condition on.

Usage of the codes at this page is free. If you use them for your scientific work, please refer to my papers. Thanx, Jesse