Approximating Markov Chains for Bootstrapping and Simulation

In this work we develop a bootstrap method based on the theory of Markov chains. The method moves from the two competing objectives that a researcher pursues when performing a bootstrap procedure: (i) to preserve the structural similarity -in statistical sense-between the original and the bootstrapped sample; (ii) to assure a diversiﬁcation of the latter with respect to the former. The original sample is assumed to be driven by a Markov chain. The approach we follow is to implement an optimization problem to estimate the memory of a Markov chain (i.e. its order) and to identify its relevant states. The basic ingredients of the model are the transition probabilities, whose distance is measured through a suitably deﬁned functional. We apply the method to the series of electricity prices in Spain. A comparison with the Variable Length Markov Chain bootstrap, which is a well established bootstrap method, shows the superiority of our proposal in reproducing the dependence among data.


Introduction
The heart of the bootstrap -introduced by Efron (1979)-consists of resampling some given observations with the purpose of obtaining a good estimation of the statistical properties of the original population. Among the different bootstrap methods, a prominent role is played by those based on Markov chains (see e.g. Athreya and Fuh, 1992;Bühlmann, 1997;Horowitz, 2003; Paroditis and Politis, 2001a, 2001b; Anatolyev and Vasnev, 2002;Bertail and Clémençon, 2007). The major advantage of this approach is that it is entirely data driven, so that it can smoothly capture the dependence structure of an observed time series, releasing a researcher from the risk of wrongly specifying the model, and from the difficulties of estimating its parameters.
In this paper we develop an original general approach to determine the relevant states and the memory (i.e. the order) of a Markov chain. The bootstrap procedure advanced here works similarly to that of Anatolyev and Vasnev (2002), who propose a Markov chain bootstrap where states correspond to the intervals resulting from a partition of the state space (of an observed time series) into I quantiles. However, differently from that work, our proposal places much greater care in identifying the states of the Markov chain. In particular, the approach we propose is based on the joint estimation of the relevant states and of the order of a Markov chain through an optimization problem. The solution identifies the partition which groups the states with the most similar transition probabilities. In this way the resulting groups emerge as the relevant states, that is the states which significantly influence the conditional distribution of the process. In this work we also extend theoretically the analysis in Cerqueti et al., (2010Cerqueti et al., ( , 2012Cerqueti et al., ( , 2013 by introducing L p norm based distance measures. We also show that the minimization of the objective function represented by the distance measure of the partitions, which is based on the transition probabilities of the states, corresponds to the minimization of the information loss function in the sense of Kolmogorov (1965). The optimization problem includes also a "multiplicity" constraint, which controls for a sufficient diversification of the resampled trajectories. Our proposal exploits the powerful conditioning tool provided by the transition probability matrix of Markov chains to model correctly and efficiently random processes with arbitrary dependence structure. The results shown in the application to the electricity prices of the Spanish market confirm the better performances of the method proposed here with respect to a well established bootstrap approach, such as the Variable Length Markov Chain (VLMC) bootstrap of Bühlmann and Wyner (1999).
The paper is organized as follows. Section 2 introduces the settings of the model. Section 3 clarifies the theoretical foundation of the optimization problem we deal with. Section 4 formalizes the optimization problem. Section 5 provides a validation of the theoretical results through numerical experiments based on real data. Section 6 offers some conclusive remarks.

Model
We suppose that we observe N realizations homogeneously spaced in time of a phenomenon and we introduce the set of such time-ordered observations as E = {y 1 , . . . , y N }. There exist J N ≥ 1 distinct states a 1 , . . . , a J N ∈ E. The corresponding subsets of E, denoted as E 1 , . . . , E J N and defined as constitute a partition of E. Moreover, fixed z = 1, . . . , J N , then the frequency of state a z in the observed time series E is the cardinality of E z . Let A = {a 1 , . . . , a J N } be the range of the observed time series. We now consider a time-homogeneous Markov chain of order k ≥ 1, denoted as X = {X(t),t ≥ 0}, with state space A. To ease the notation, in the following we will simply write Markov chain instead of time-homogeneous Markov chain. The k-lag memory of the Markov chain implies that the transition probability matrix should account for conditioning to trajectories of length k. Therefore, we refer hereafter to a k-path transition probability matrix.
Let us consider a z ∈ A and a h = (a h,k , ..., a h,1 ) ∈ A k . The row vector a h is the ordered set of k states a h,w ∈ A, w = 1, ..., k, listed, in a natural way, from the furthest to the closest realization of the chain. The row vector a h will be called k-path. This ordering of the realizations will be maintained throughout the paper. The Markov chain has transition probability from k-path a h to state a z given by According to Ching et al. (2008), we estimate P(a z |a h ) with the empirical frequencies f (a z |a h ) related to N realizations of the phenomenon. For the sake of simplicity, we avoid introducing throughout the paper a specific notation for the estimates of the probabilities, therefore we estimate P(a z |a h ) by Let us now introduce the set Λ of the partitions of A. A generic element λ ∈ Λ can be written as λ = {A 1 , . . . , A |λ | }, where |λ | is the cardinality of λ , with 1 ≤ |λ | ≤ J N , and {A q } q=1,...,|λ | is a partition of nonempty subsets of A. The cardinality of Λ is B(J N ), i.e. the Bell number of the J N elements in set A.
Extending our notation to a multidimensional context, we consider the set Λ k of k-dimensional partitions. The set Λ k contains the partitions we will focus on in the present paper. A k-dimensional partition of Λ k is denoted as λ and is defined as where A q w ,w is a class of the partition λ w and λ w is a partition of A at time lag w. A k-dimensional partition of Λ k can also be (more easily) represented by the k-tuple of the partitions λ w , w = 1, ..., k, which the classes A q w ,w belong to. So the partition λ can also be identified with the notation λ = (λ k , . . . , λ w , . . . , λ 1 ). Such notation describes the fact that λ is a time-dependent partition of A, i.e. A is partitioned in different ways for each time lag w, w = 1, ..., k. The cardinality of Λ k is [B(J N )] k , the cardinality of the partition λ is |λ | = ∏ k w=1 |λ w | . We refer to the probability law P introduced in (1) and define where The quantity in (2) is the transition probability to reach state a z at time t after the process has been in the classes A q k ,k , . . . , A q 1 ,1 in the previous k times. The transition probabilities P(a z |A q ) in (2) are estimated, as usual, through the empirical frequencies: The quantities P(a z |A q ) estimate a new transition probability matrix. To keep the notation as simple as possible, we continue to refer to this matrix as to the k-path transition probability matrix. We deal in our paper with a couple of questions related to finding the Markov chain which best describes the observed time series E: • Which is the optimal k? • Which is the optimal clustering of A for each time lag w, with w = 1, ..., k?

Theoretical foundation of the optimization problem
In the context of bootstrapping, the conflicting scopes of a resampling procedure are two: on the one side, to maintain the statistical properties of the original sample (similarity); on the other side, to allow for a sufficient level of diversification between the original and the bootstrapped sample (multiplicity). In our model an optimal clustering procedure of the state space of a Markov chain -based on the fulfillment of similarity and multiplicity requirements-is implemented, to gain mathematical tractability when resampling.
The theoretical framework closer to our proposal is the field of information theory, with specific reference to information loss. We can, in general, define a functional space G whose elements g act on the Markov chain X by defining a new Markov chainX. The states ofX are the elements of a partition of A k . There is a clear bijection between the g of G and the partitions λ g of A k , and they are associated to a specific amount of information. Let us writeX = X|λ g to formalize that the new Markov chain is X conditioned to the information generated by the partition λ g .
Since merging two k-paths cancels part of the information on the transition probabilities available letting them distinct, we can say that the partitions λ g imply in general information loss. This argument is in line with Kolmogorov (1965).
In order to measure such an information loss, a nonnegative functional η X ∈ [0,η] can be introduced, such that η X (X) represents a distance measure between X andX, which increases as the loss of information does. If η(X) = 0, no information is lost, while η X (X) =η means thatX is generated by a partition providing the maximum level of information loss (no information left in passing from X toX). The consistency requirements in the boundary situations of 0 andη lead to specific situations in our setting. We list them below, along with a brief explanation.
1. η X (X) = 0. This is the case of full information and occurs whenX = X. In this case the partition λ g of the state space is the finest one: λ g is the partition keeping separate all the states of the Markov chain (singleton partition).

If
where Ω represents the sample space of the probability space where the Markov chain is defined-, then η X (X) attains its maximum. In this case the maximum level of information is lost. In fact, the corresponding partition λ g collects all the elements of the state space in a unique set (all-comprehensive partition).
In the following section we will introduce a distance indicator, d λ , and a multiplicity measure, m λ , which will be used to measure similarity and multiplicity, respectively. They are two specific (information loss) distance measures η, which indeed satisfy conditions 1. and 2.. It can be expected that a partition of states of a Markov chain minimizing, in a controlled way, an information loss distance measure, will condition the evolution of the bootstrapped samples more consistently than it would occur if that partition had been organized otherwise.

Optimization problem: A formalization
The concept of optimality must be intended as satisfying the requirements of the bootstrap procedures of statistical closeness between the original and the bootstrapped sample -minimization of a distance indicator-and a certain degree of diversification -constraint on the level of a multiplicity measure-. A constrained minimization problem can be defined following this line. We enter into its details.

L p -type distance indicator
We define an L p -type measure of the multidimensional class A q as follows: where In this case, we preserve the similarity by imposing that the classes of a suitable partition have a low value of the indicator defined in (3). We can finally characterize the distance d λ of partition λ with the average value of its classes distances: where |A q | is the cardinality of partition class A q and C = ∑ |λ | q=1 |A q |.

L r -type multiplicity measure
The multiplicity measures we propose are based on the size of the partition classes. Let us define l λ an absolute multiplicity measure of the partition λ : We define the relative multiplicity measure m λ , related to the partition λ , by normalizing l λ as follows:

Optimization problem
We now present the optimization problem based on the similarity and multiplicity criteria developed so far.
. . , N}, and λ * = (λ * k * , . . . , λ * 1 ) ∈ Λ k * . The couple (k * , λ * ) is said to be d-γ-optimal when it is the solution of the following minimization problem: In Definition 1 we have that k * is the optimal order of a Markov chain describing the evolutive phenomenon. Moreover, λ * provides the optimal time-dependent clustering of the state space A, in order to have an approximation of the k * -path transition probability matrix.
According to the definitions of d λ and m λ , we can briefly discuss the optimization problem. Letting the multiplicity measure reach its minimum (γ = 0) is equivalent to allow for the singleton partition, which ensures the minimum distance (d λ = 0). Letting γ = 1 corresponds to forcing the maximum level of multiplicity. This boundary in our case is satisfied only by the all-comprehensive partition, when the distance indicator takes its maximum value.

Empirical validation of the model
This section aims at comparing the proposed method with another well established bootstrap procedure, namely the Variable Length Markov Chain (VLMC) bootstrap (Bühlmann and Wyner, 1999). Only the case of p = 1 and r = 2 is discussed, to further support the strength of the method. Moreover, our method is bounded to order k = 7, while VLMC self-calibrates k. The original sample is the daily Mibel Spanish Electric System Arithmetic Average Price (euros per MWh) from January 2nd, 1998 to December 31st, 2003.
To assess the quality of the method, we analyze the statistical properties of the bootstrapped samples and compare them with the ones of the original sample. To this goal, we calculate the following statistics: average, standard deviation, skewness, kurtosis, minimum, maximum, and autocorrelations at lag k (where k = 1, . . . , 8).
The comparison focuses on the percentile rank that the original sample takes with respect to the bootstrap distribution. A package written in R named "VLMC" (available at the web page http://cran.r-project.org/) was used to generate the bootstrapped samples for VLMC. For what concerns our method, the optimization problem (4) was solved heuristically by means of a Tabu Search algorithm in order to control for its computational complexity (see Cerqueti et al., 2013). The number of bootstrapped samples is 5000.
The results of the comparison in Table 1 show that VLMC regularly generates narrower ranges between the 5th and the 95th percentiles than our method. However, they are only seldom consistent. In particular, autocorrelations at all lags are severely under-replicated in the VLMC bootstrapped samples. Such results confirm the expectation that the method proposed here, thanks to the minimum information loss pursued in our optimization problem, generates bootstrapped samples reproducing more carefully the original dependence among the data of an original sample.

Conclusive remarks
This paper proposes an optimization problem to the goal of estimating the dimensions of the transition probability matrix of a Markov chain for simulation and bootstrap purposes. The optimization problem here formalized extends that presented in Cerqueti et al. (2010), in that it introduces distance measures of L p (L r ) type. The satisfactorily results obtained in the above-mentioned paper are theoretically further improved. The model is grounded on information theory, and it has been numerically validated through an experiment based on real data.