A mixed integer linear program to compress transition probability matrices in Markov chain bootstrapping

Bootstrapping time series is one of the most acknowledged tools to study the statistical properties of an evolutive phenomenon. An important class of bootstrapping methods is based on the assumption that the sampled phenomenon evolves according to a Markov chain. This assumption does not apply when the process takes values in a continuous set, as it frequently happens with time series related to economic and financial phenomena. In this paper we apply the Markov chain theory for bootstrapping continuous-valued processes, starting from a suitable discretization of the support that provides the state space of a Markov chain of order k≥1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k \ge 1$$\end{document}. Even for small k, the number of rows of the transition probability matrix is generally too large and, in many practical cases, it may incorporate much more information than it is really required to replicate the phenomenon satisfactorily. The paper aims to study the problem of compressing the transition probability matrix while preserving the “law” characterising the process that generates the observed time series, in order to obtain bootstrapped series that maintain the typical features of the observed time series. For this purpose, we formulate a partitioning problem of the set of rows of such a matrix and propose a mixed integer linear program specifically tailored for this particular problem. We also provide an empirical analysis by applying our model to the time series of Spanish and German electricity prices, and we show that, in these medium size real-life instances, bootstrapped time series reproduce the typical features of the ones under observation.


Introduction
After the seminal paper by Efron (1979), several developments and applications of bootstrapping methods appeared in the literature.Methods following the original idea by Efron and based on re-sampling of model errors have been largely applied in Economics and Finance.The reader is referred to Freedman (1984), Freedman and Peters (1984), and Efron and Tibshirani (1993) for a methodological discussion and to Brock et al. (1992) and Sullivan et al. (1999) for an application to financial markets.However, the original approach of Efron suffers, in general, of model misspecification risk and requires observations to be time-independent.To overcome such limitations, nonparametric, model-free bootstrapping methods are suggested by the literature.In Bühlmann (2002) several bootstrapping methods such as the block, the sieve, and the local methods are compared.The advantage of nonparametric, model-free methods is that they do not require the observations to be timeindependent: even data are able to capture the dependence structure of time series, thus relieving the researcher from the responsibility of choosing a model.Among the nonparametric bootstrapping methods, a relatively recent group is based on the Markov chain theory (see, e.g., Anatolyev and Vasnev 2002).The main issue in this research direction is the estimation of the true dimension of the transition probability matrix, which, in turn, consists of estimating the relevant states and order of the process.Even if these two estimates refer to different aspects of the process, they are not independent and they have been extensively examined in the area of Information Theory to model alphabet processes (see, for example, Bühlmann and Wyner 1999;Bühlmann 2002).
In some real applications, the order k of a Markov process is specified in advance.According to the observed series, the value of k can be rather large.Moreover, the series might range within a wide set of values.All such occurrences may cause the number of rows in the transition probability matrix to be too large w.r.t. the information necessary to model the process (i.e., several rows are redundant).In this case one would need to reduce the number of rows, focusing in particular on those which do not "carry" enough information.
Generally, reducing the dimension of the transition probability matrix means to opportunely aggregate the states of the related Markov chain into "super-states", in order to obtain a reduced model that still well represents the original process.Therefore, model reduction in Markov chains basically consists of finding a suitable partition of the state space and the corresponding aggregation of the rows and/or the columns of the transition probability matrix.
Different reduction methodologies have been proposed in the literature to fulfil the different requirements ruled by the specific application under study.
The aggregation rules strongly depend on the order of the considered Markov chain.In the classical framework of Markov chains of order 1, the aggregation should be performed to obtain states of a new Markov chain of order 1.For example, Zhu et al. (2002), who suggested an efficient link prediction procedure for the design of adaptive web sites, based their reduction of the transition probability matrix on a "similarity" criterion and applied an algorithm devised by Spears (1998).
In the literature, a well-known Markov chain reduction approach is lumping, in which the states of a Markov chain must be aggregated guaranteeing that the transition probabilities of the reduced Markov chain satisfy the Chapman-Kolmogorov equations.A lumping approach is followed, for example, by Deng et al. (2011), who studied the problem of reducing a Markov chain of order 1 by minimizing the "Kullback-Leibler" divergence rate between the original and the reduced transition probability matrix.The Kullback-Leibler divergence rate, which measures the loss of information when replacing the original matrix with the reduced matrix (see Rached et al. 2004), is the objective function of their optimization model.Since it is a nonlinear function, their problem is hard to solve so that they had to apply a heuristic solution procedure.For lumping, we refer the reader to the introductory paper of Burke and Rosenblatt (1958) and also to Kemeny and Snell (1976), Barr and Thomas (1977), Abdel-Moneim and Leysieffer (1984), and White et al. (2000).More recently, Thomas (2010) provided an encyclopedic overview on lumping, along with the related main theoretical results.
A different aggregation approach for the Markov chain reduction problem relies on spectral graph theory.The basic idea is searching for an optimal partition of a graph into subgraphs by implementing a procedure grounded on the eigenvectors of a special pre-specified matrix.For excellent surveys and bibliographic reviews on spectral graph theory, see Chung (1997).In spectral theory, graphs and Markov chains can be linked in a natural way: indeed, a Markov chain can be obtained as a random walk on the vertices of a given graph, and the corresponding transition probabilities are the normalized weights assigned to the edges.The graph's partition is then translated into a partition of the state space of the Markov chain.The spectral Markov theory generalizes the lumping framework and it has been the scientific ground of some important studies.Verma and Meila (2003) compared different clustering three-stage algorithms based on matrices eigenvectors.In particular, they introduced a preprocessing phase in which the matrix of interest M was normalized to M, moving de facto from graphs to a Markov chain with transition probability matrix M. Meila and Xu (2004) discussed the explicit relationships between graphs and Markov chains and introduced a new spectral clustering-based optimization problem, that falls into the considerably difficult class of NP-hard problems.
When dealing with an order k > 1, the constitutive properties of Markov chains of order 1 no longer apply.In particular, the rows of the transition probability matrix are null when the related k-state occurs with zero probability; moreover, the Chapman-Kolmogorov Theorem cannot be properly stated.This shows that the lumping framework does not fit any Markov chain reduction problem.
When the model reduction in Markov chains is formulated as a partitioning problem, one has to take into account that, in general, the computational complexity of the problem requires the use of heuristic procedures.For example, a heuristic approach was adopted by Cerqueti et al. (2013), who proposed a Tabu Search procedure for the problem of simultaneously partitioning the state space of a continuous-valued process and identifying the order of the process.Their approach starts by modeling a continuous-valued process as "Markov switching regimes" (see, e.g., Hamilton 1996;Jeanne and Masson 2000;Hamilton 2005), and identifying the "relevant" states defined by the thresholds that mark the passage from a regime to another (see also Cerqueti et al. 2010).Their idea was to partition the set of rows of the transition probability matrix (associated to the discretized version of the process) to reduce its size and yet minimizing the information loss that may derive from matrix compression.Rows belonging to the same class of the partition are defined "similar" according to a suitable distance indicator.The computational effort for solving such problem increases exponentially w.r.t. the order of the Markov chain, and follows the law of the Bell numbers w.r.t. the number of states included in the initial discretization.Even if the authors provided a pre-processing that drastically reduces the size of the solution space,1 the partitioning problem remains intractable and a heuristic procedure for its solution is required.
In this paper we are concerned with Markov chains of order k ≥ 1, focusing mainly on the case k > 1. Aggregation involves only the rows of the transition probability matrix, i.e., the k-states of the process, while the columns of the matrix remain associated to the original (single) states.Therefore, in general, we have a Markov chain defined by a rectangular transition probability matrix, and also the aggregated matrix is rectangular.This is a main distinctive feature of our model that requires an approach different from lumping, in which the transition probability matrix of the given Markov chain is required to be square, and both its rows and columns must be aggregated producing a reduced Markov chain with a square transition probability matrix (whose size is reduced w.r.t. the size of the original matrix).Hence, the grounding assumption on the order k of the Markov chain prevents us from using lumping techniques and requires an ad hoc aggregation procedure.
We assume k fixed and focus on the problem of partitioning the set of rows of the transition probability matrix optimally by suggesting an exact approach through a mixed integer linear program (MILP).The objective function is based on a dissimilarity measure between pairs of rows, and a special constraint is introduced to guarantee that the bootstrapped series are sufficiently diversified.The constraint is introduced to fulfil the basic requirement of diversification of outputs when applying bootstrapping methods.In our application, relying on the partition provided by our MILP, we were able to generate bootstrapped series that maintained the statistical properties of the original sample without being its exact replication.
The proposed MILP can be solved via common optimization packages, like the most recent GUROBI optimization solver.This allows to efficiently solve problems where the number of rows of the matrix ranges from 40 to 60, which is the typical size of certain economic and financial phenomena.As a real application of our method we considered the problem of bootstrapping series of the Spanish and German electricity prices observed daily for 6 and 7 consecutive years, respectively.We point out that such time series are considered "hard cases" by the literature as to replicate, because of their nonlinear dependence structure.
This paper contributes to the literature on Markov chain bootstrapping by providing a nonparametric, model-free approach, which is particularly suitable for time series with nonlinear data dependence, as those related to evolutive phenomena in electricity markets.Our work is focused on the original purpose of bootstrapping, i.e., to improve the parameter estimation obtained from a sample of observed data.We point out that bootstrapping consists of resampling the original data (with replacement), and, therefore, it cannot be immediately used to simulate out-of-sample trajectories, as it generally happens for other simulation methods (e.g., Monte Carlo, neural networks, etc.).This kind of analysis is out of the scope of our application.
To conclude, we point out that in the literature there are several contributions dealing with mixed integer linear programming for optimal (constrained) partitioning problems of different nature (see, e.g., Mueller and Kramer 2010;Saǧlam et al. 2006, and the references therein).It must be observed that, although for this paper we adopted standard modeling techniques (such as, for example, the one to linearize the objective function), our MILP provides a new contribution to the Markov chain model reduction under different aspects.It is specifically designed for the real-life issues related to Markov chain bootstrapping.In particular, the objective function can be viewed as an innovative way for measuring information loss when the Markov chain reduction problem consists of aggregating only the rows of the transition probability matrix.In addition, it provides tools to control diversification of the bootstrapped series, and proposes a bi-objective approach able to find aggregations of rows with a good balance between the number of "super-states" and the dissimilarity between rows aggregated in the same super-state.
The paper is organized as follows.In Sect.2, we discuss the Markov chain bootstrapping in detail and describe the mixed integer linear program proposed for partitioning the rows of a transition probability matrix.Section 3 is dedicated to the analysis of two real-life problems: after describing the data sets, we illustrate our results for the reduction of the transition probability matrix and then provide a comparative statistical analysis of the bootstrapped time series with respect to those observed.In Sect.4, we collect some final considerations and concluding remarks.

Markov chain bootstrapping based on mixed integer linear programming
In this section we introduce the Markov chain bootstrapping problem and propose an optimization approach to solve it.First we introduce the basic idea underlying the problem, together with the corresponding notation and definitions.Consider a time-varying phenomenon.Under the hypothesis that the phenomenon evolves according to a k-th order Markov chain (X (t), t ≥ 0), with k ≥ 1, our aim is to resample it through a bootstrapping method.The estimation of the transition probability matrix associated to (X (t), t ≥ 0) relies on an available time-ordered sample of observations of the investigated phenomenon.
In bootstrapping procedures, two aims which are somehow conflicting are pursued: on the one hand, the exact replication of the sample at each simulation should be avoided (diversification or multiplicity criterion); on the other hand, the statistical properties of the original sample should be reproduced in the replications as much as possible (similarity criterion).In order to generate a bootstrapped series satisfying the two above properties, we propose a method based on the aggregation of the rows of the transition probability matrix of (X (t), t ≥ 0).The problem is formulated as a partitioning problem of the set of rows of the transition probability matrix, and it is modeled as a mixed integer linear program in order to solve it exactly.

Notation and definitions
Assume that the possible realizations of an evolutive phenomenon vary in an interval [α, β] ⊆ R. Consider the following series of time-ordered realizations of the phenomenon observed in the time horizon (1, . . ., t, . . ., T ): e(T ) = (e 1 , . . ., e t , . . ., e T ). (1) We also refer to T as the length of the observed series.[α, β] into n intervals such that b 0 = α and b n = β.In order to simplify the notation, we refer to the z-th interval [b z−1 , b z ) as α z .This is only a matter of notation, and α z is not meant to be a single point replacing the whole interval [b z−1 , b z ).By definition, one has: We collect the elements of such partition in a set A = {α 1 , . . ., α z , . . ., α n } so that A represents the set of all possible states for the time-ordered series under analysis.States in the set A are called theoretical.We will refer to the above partition of [α, β] as the initial partition.A state corresponds to one of the intervals α z , z = 1, . . ., n, which are then called the initial intervals or initial states.The choice of the term "state" is taken to refer explicitly to the Markov chain framework described later in this section.Based on the above discretization, one has that, for each observed value e t (point), there is a unique α z (interval) in A such that e t ∈ α z .As a consequence, a time series such as formula (1) will be represented as a sequence of states of a Markov chain a(T ) = (a 1 , . . ., a t , . . ., a T ), where a t = α z whenever e t ∈ α z , for z = 1, . . ., n.
Given the set A of initial states, for a fixed k the cartesian product A k collects the ktuples formed by states in A (theoretical k-states).We denote a generic theoretical k-state by We denote a sequence of k < T consecutive elements extracted from e(T ) by: e(t, k) = (e t , . . ., e t+ j , . . ., e t+k−1 ) for some t ∈ {1, . . ., T − k + 1}. (2) Notice that in formula (2) e(t, k) depends on two parameters, namely, the length k of the above sequence and t, which identifies the starting time of the sequence.An observed k-state Assume that the evolutive phenomenon is modeled as a Markov chain of order k, (X (t), t ≥ 0), with state space A. Then, the k-th order transition probability matrix is estimated by using the definition of Ching et al. (2008), which involves the estimation of the empirical frequencies.
We introduce the transition probability for the process to reach state α z immediately after the sequence of states of α h (k), that is: ( 3 ) Let us denote by M the k-th order transition probability matrix, with |A k | = n k rows and |A| = n columns.To avoid a cumbersome notation, we set n k = m.Matrix M is built starting from the observed sample e(T ) and its generic entry (h, z) contains the value μ According to the approach in Ching et al. (2008), we estimate μ where is the number of times that a sequence (α h (k), α z ) has been observed in the sample.For a given k, each row of transition probability matrix M contains the transition probabilities from a k-state α h (k) ∈ A k to the states α z ∈ A. Formula ( 4) is used to compute M on the basis of the original sample.Notice that, when a k-state is observed only once in the original series e(T ), the estimation of its transition probabilities is equal to 1 for the unique state to which the corresponding trajectory evolved, while it is equal to 0 for the other states.We refer to this case as deterministic k-states, and the corresponding rows of the transition probability matrix are called deterministic rows.Each row that is not completely filled with zeros reports the estimated transition probabilities from an observed k-state to the observed states, which substantially means that only sequences α h (k) such that α h (k) ≡ a(t, k) for some t ∈ {1, . . ., T − k + 1} and states α z ≡ a t for some t ∈ {1, . . ., T } are considered.
To introduce our MILP, we indifferently refer to α u (k) and α v (k) of A k and to the corresponding rows, u and v respectively, of matrix M below.Similarly, we will indifferently refer to the set A k and to the corresponding set V of rows of matrix M.
The dissimilarity between two rows u and v of M can be measured as follows: It is, in fact, a distance indicator between the rows of matrix M which takes values in the interval [0, 2] (see Cerqueti et al. 2010).As we will argue below, the dissimilarity d uv can be viewed as a proxy for the "cost" of putting k-states α u (k) and α v (k) in the same class of a partition, and, indeed, the objective function of our optimization model is based on it.

An optimization approach to the aggregation of the rows of a transition probability matrix
Let us introduce our partitioning problem of the set V of rows of transition probability matrix M. Consider a partition of V into q classes, π = {C 1 , . . ., C p , . . ., C q }, with 1 ≤ q ≤ m, and let be the set of all possible such partitions.
The diameter of a class C p is defined as the maximum dissimilarity between two elements in C p , and it is denoted by δ(C p ): The diameter of a partition π ∈ is defined as Remark 1 According to formula (8), and since for each partition π ∈ .In the special case of the partition π in which each class includes only one element, which we call singleton partition, one has (π) = 0. Notice that, given π, π in , π is a refinement of π when π has the same classes of π with the exception of some classes which in π are further partitioned.When π is a refinement of π , one has (π) ≤ (π ).The less refined partition is the one composed by only one class grouping all the elements of V , and corresponds to the partition with the maximum diameter, call it max , among the partitions π ∈ .Finally, notice that max ≤ 2 and, depending on the data set under study, it may, or may not, reach this upper bound.
For a given π ∈ , we denote by C p a class for which δ(C p) = (π), that is: In our partitioning problem we aim to minimize both the number of classes and the diameter of the partition.These two objectives are obviously in conflict, since the diameter of a partition tends to increase when the number of classes decreases and vice versa.In fact, we have a bi-objective problem which we handle by minimizing the diameter of the partition while controlling for the number of its classes.For this purpose, we introduce in the model a parameter γ ≥ 0 that limits the value of the diameter (π) from below, implying a constraint on the cardinality of the partition.By means of γ we implement a control in the bootstrapping procedure to meet the multiplicity criterion, i.e., to avoid the exact replication of the original sample.For a fixed value γ ≥ 0 and an integer 1 ≤ q ≤ m, we formulate the following partitioning problem: among the partitions π ∈ with at most q classes and such that δ(C p) is at least equal to γ , find a partition π * that minimizes (π).
The problem can be formulated as a mixed integer linear program, as it is illustrated in the following.
To formalize the optimization model, let us observe that a partition π = {C 1 , . . ., C p , . . ., C q }, with 1 ≤ q ≤ m, can be equivalently written as π = {C 1 , . . ., C p , . . ., C m }, with Let v ∈ V be a generic row of M. We introduce the binary variables x vp and y p , with p = 1, . . ., m, such that: It is easy to see that the product x up • x vp is equal to 1 if and only if the rows u and v belong to the same class C p of the partition: Hence, the cost of assigning both u and v to class C p is given by: With this notation the diameter of class C p can be rewritten as δ(C p ) = max u,v∈V The problem of finding the partition of the set of rows V into at most q classes that minimizes the maximum diameter of a class can be formulated as follows: (10) Notice that in the above formulation we bound the number of classes of the partition by q.Thus, varying q in {1, . . ., m} allows us to tackle the bi-objective problem as a sequence of m single-objective ones.In problem (10) the first set of constraints states that each row must belong to a single class of the partition.According to constraints (2), a row can belong to a class C p only if C p is active.Constraint (3) provides an upper bound on the number of classes of π.Problem ( 10) is an integer nonlinear program which can be linearized by introducing additional association variables w uvp , with 0 ≤ w uvp ≤ 1: and the following set of constraints: Thus, if both u and v belong to class C p , one has x up = x vp = 1, and, therefore, w uvp = 1.
On the other hand, if either x up = 0 or x vp = 0, one has w uvp = 0, too.This also implies that the bounds w uvp ≤ 1 are always satisfied, and, hence, they need not to be included explicitly in the problem formulation.Notice that constraints (11) guarantee that variables w uvp assume only values 0 or 1, and, therefore, they can be introduced in the model as real variables.
The objective function can be written in terms of the new variables as follows: Then, the objective function can be linearized by introducing a new variable d that replaces max u,v∈V {d uv • w uvp } and adding the following set of constraints: Let us notice that the above model does not prevent the optimal objective function value to become very small (it may even turn out to be 0 ).We recall that the core of the bootstrapping method is to generate resamplings of the observed series e(T ) which should be sufficiently "similar" to e(T ), in order to guarantee that they can be seen as different realizations of the same phenomenon that generated e(T ).However, they have to be also sufficiently "diversified" from e(T ), in order to have some degree of "variability" among them.The optimal partition applied to M should then be structured so as to guarantee such requirements as much as possible.If, on one hand, similarity is pursued in the model via the minimization of the diameter, on the other hand, diversification can be controlled by imposing the diameter not to be smaller than a prefixed (positive) threshold γ .This additional constraint prevents us from choosing the trivial singleton partition, where each class is formed by a single element.This happens especially if we choose γ = 0. Indeed, the value of the objective function is zero for this partition; however, no aggregation of the rows of M is actually performed.In order to formalize the additional condition related to the threshold γ , we introduce the binary variables t γ uvp defined as follows: Adding further suitable constraints involving t γ uvp and γ , we obtain the MILP representing our specific partitioning problem: In the above model, constraint (8) describes the relation between the variables w and the new variables t γ .Constraint (9) forces at least one variable t γ to be equal to 1, thus guaranteeing that the diameter of the optimal partition is at least γ .Cerqueti et al. (2010) discuss the coherence of some distance indicators to informationtype criteria advanced by Kolmogorov (1965).According to Remark 1, the dissimilarity measure used in the objective function of model ( 12) respects the conditions of Kolmogorov (1965).Therefore, the optimal solution π * of model ( 12) ensures that the "loss of information" (about the evolution properties of the process) is minimized for any given level γ fixed to meet the multiplicity criterion.In other words, the information efficiency of π * implies that, although generated through a coarser structure of the information in the aggregated transition probability matrix M * , a bootstrapped series maintains a good statistical similarity w.r.t. the original sample and a satisfactory diversification from other bootstrapped series (multiplicity) simultaneously.The bootstrapping method adopted in this paper has been advanced in Cerqueti et al. (2013), to which we refer the interested reader for further details.
Here the procedure takes as inputs the optimal partition π * of model ( 12), the compressed transition probability matrix M * associated to π * , as well as the observed time series e(T ) = (e 1 , . . ., e t , . . ., e T ).
Besides the above theoretical arguments, the good statistical properties shown by our bootstrapped series will be analyzed in detail in Sect. 3.4, where the values of several statistics computed on the bootstrapped series will be compared with the corresponding values calculated on the original sample.

Application and results
In this section we provide a case study application of the bootstrapping method we propose.It is based on data taken from the Spanish and German electricity markets, as it is described in detail below.The analyzed series show several interesting features which make them difficult to treat for a bootstrapping method.Therefore, they represent challenging tests for evaluating the performance of our approach.

Data
In our application we study two time series, namely the "Mibel Spanish Electric System Arithmetic Average Price" and the German "EEX Phelix Day Base Price".The series have been observed daily in the period from January 1st, 1998 to December 31st, 2003 (Spain) and from June 17th, 2000 to May 8th, 2007 (Germany).The prices are expressed in euros and refer to 1 MWh.The Spanish time series consists of T = 2190 observations, while for the German series we have T = 2517.
Figures 1 and 2 show the two time series which are characterized by the following features: -a weekly and annual seasonality;  Spikes are occasional, since they usually correspond to unexpected shortages on the supply side of the electricity system, or unexpected and temporary increases of the demand (e.g., sudden meteorological events driving to high consumption).Since we deal with daily data, intra-day seasonality is neglected in this analysis.Because of the joint presence of such features, in the literature, electricity price series are usually considered as "hard to model" cases.For a review of the difficulties in modeling electricity prices and the methods developed to solve them, see, for example, Bunn (2004), Huisman and Mahieu (2003), Weron et al. (2004), andWeron (2006).Raw data prices have been removed of an (exponential) weekly seasonal component as well as of an (exponential) trend.Two main reasons justify this pretreatment.Removing the weekly seasonality lets us free to reduce the order of the Markov chain below 7, which corresponds to a great reduction in the complexity of our problem.The removal of the trend component makes the series more stationary.Of course both components are added back to the bootstrapped series.The estimation of exponential (rather than linear) components is recommended to avoid that this removal/reintroduction process generates occasional negative prices.In Appendix 1 we give the details about this data treatment.

Preliminary segmentation of the support
As already explained in Sect.2.1, in order to discretize our continuous-valued process, a preliminary segmentation of the support [α, β] is performed by partitioning it into n initial intervals α 1 , . . ., α z , . . ., α n .We notice that, if there is econometric evidence or theoretical reasons that the regimes of the phenomenon are a few, increasing the number n of initial intervals will help finding a segmentation whose classes approximate the regimes more precisely.Unfortunately, fixing large values for n has two limitations.First, this will increase the number of rows of transition probability matrix M, thus affecting the compression procedure on M from a complexity point of view.Second, this reduces the number of observations available, thus affecting the statistical significance of the estimates of the transition probabilities, since this reduces the expected frequency of observations for each initial interval.Therefore, the value of n should be set in order to overcome such limitations.In our case study, we finally set n = 12 which, however, corresponds to an order of magnitude larger than the number of states commonly considered in the literature for the regimes of electricity prices (usually 2 or 3, see, e.g., Huisman and Mahieu 2003).
The segmentation of the support into 12 initial states was performed through the minimumvariance clustering procedure provided in Ward (1963), after having removed trend and weekly seasonality from the original series (see Appendix 1 for further details).Figures 3  and 4 show the series of Spain and Germany, together with the initial 12 intervals α 1 , . . ., α 12 (separated by horizontal lines).Appendix 2 details this preliminary segmentation.
Transition probability matrices M S and M G of order k = 2, estimated for the Spanish (S) and German (G) markets, respectively,3 were computed using formula (4).
The set of observed 2-states, O 2 , is composed by 2188 elements for the Spanish instance and 2515 elements for the German one. 4The percentage of deterministic 2-states over the observed ones is about 1 % in both cases (23 for Spain and 21 for Germany).The remaining observed 2-states are probabilistic (2165 for Spain and 2494 for Germany).In partitioning the rows of transition probability matrices M S and M G , we actually considered only the probabilistic 2-states.On the other hand, we defined a priori a single class collecting the non observed 2-states, while we set up as many classes as the number of observed deterministic Price (€/MWh) Fig. 4 German daily electricity prices without exponential trend and exponential weekly seasonality.Horizontal lines mark the extremes of intervals α 1 , . . ., α 12 2-states.For n = 12 we obtained 63 and 44 probabilistic rows in the transition probability matrix of Spain and Germany, respectively, while the deterministic rows are 19 in both cases.

Optimization phase
The central element in model ( 12) is given by parameter γ .This parameter provides a lower bound to the objective function value.Fixing γ to a positive value in the model means imposing that at least one class of the partition is not formed by a single row.On the contrary, when γ = 0, no bound is imposed, so that the optimal solution always corresponds to the trivial singleton partition, provided that the maximum number of classes q in constraint (3) of model ( 12) is set large enough (it suffices to set q = m).Since the objective function value of the singleton partition is equal to 0, such partition is, obviously, the one with maximum similarity within its classes,5 and, therefore, it can be taken as a benchmark for the evaluation of the final partition chosen to perform the bootstrapping method.
In our experimental analysis we tested a set of values for γ in model ( 12).According to Remark 1, we know that the set of possible values for this parameter is a subset of the interval [0, 2].Different values within this range lead model ( 12) towards different optimal solutions, since constraints ( 8) and ( 9) become stricter as the value of γ increases.We notice that, for a given γ , the same optimal value in model ( 12) could be obtained w.r.t.different possible values of q in constraint (3).In this case, for a given value of γ , we always prefer the minimum number of classes which enables model ( 12) to satisfy constraints (8) and ( 9).For each fixed value of γ this provides a preferred solution characterized by two attributes: (i) the optimal value of the objective function in model ( 12), that we denote by d * (γ ); (ii) the corresponding preferred value of q associated to such optimum, denoted by q * (γ ).It is clear that, a priori, we do not have any information about which, among the infinite values in [0, 2], is the best choice for our aggregation purposes; we know that the extreme values of the above interval are not good choices, but we need to test different values of γ and evaluate the pair of attributes (q * (γ ), d * (γ )) of the corresponding preferred solution within that interval.Indeed, different pairs of attributes correspond to different compromises between the number of classes in the final partition and the value of its diameter.In our analysis, we produce different compromises and then we compare them in order to choose the one that fits the similarity and multiplicity purposes of the bootstrapping method best.For this purpose, we tested a finite number of values for γ equally spaced in the interval [0, 2].This is a standard practice for parameter tuning when the only available information about the parameter is the range of its possible values and there is no reason to prefer one of these values a priori.
For each tested value of γ , we applied our MILP with increasing values for the integer parameter q, from 2 to m − 1.For any fixed γ , as q increases, we observed a converging process towards a stable optimal value of the objective function, i.e., the optimal value d * (γ ) is reached for q * (γ ), and it remains unchanged for q > q * (γ ), so that there is no need to choose a value q > q * (γ ).In addition, since in the model we want to minimize the maximum diameter of a class of the partition, d, and γ is a lower bound for d, the ideal case would be attaining exactly this bound.In Fig. 5 we plot the solutions obtained by applying model (12) to a data set extracted from the German time series for which the transition probability matrix has 25 rows (called German25) with different values of parameter γ .In this example one can understand that lowering the value of γ allows the model to reach smaller values of the objective function, and that any fixed value γ = 0.25, 0.5, 0.75, 1, 1.25, 1.5 can be actually reached by the objective function from the value q * (γ ) on.In this example we observe an ideal behavior of the model which, in all cases, manages to attain the minimum for the objective function.
In Fig. 5 the points (q * (γ ), d * (γ )) correspond to the most left points of the flat part of each curve.Reducing even further q, for the same γ , implies accepting higher values for d and possibly overlapping with the solutions corresponding to values γ < γ .The plot in Fig. 5 also shows that good improvements in the objective function can be obtained (for successive values of γ ) by increasing q up to q * (γ ) = 7 for γ = 0.5.Going beyond this requires q * = 16 for γ = 0.25.We did not analyze values of γ smaller than 0.25, since this would have implied too many classes for the optimal partition and required a too long computational time for solving model ( 12).In order to choose the final partition for the aggregation of the rows of the transition probability matrix, we defined an evaluation index that takes into account both the values d * (γ ) and q * (γ ): The index is defined for 2 ≤ q < m, and it positively evaluates those γ that produce a small value of d * (γ ) combined with a value of q * (γ ) "sufficiently smaller" than its maximum (q = m).Figure 6 plots this index for each value of γ for the data set German25.It shows that there is a systematic improvement whenever γ is lowered, but this becomes less evident for the last two values.We also considered the computational time required for the solver to obtain (q * (γ ), d * (γ )) and decided between the values γ = 0.5 and γ = 0.25 on the basis of both measures.Unfortunately, for γ = 0.25 the computational time was always too high to justify the small improvement in d * that it was able to produce w.r.t. the case γ = 0.5.For example, for German25, to obtain d * it takes 512 s with γ = 0.5 and 8650 s with γ = 0.25.A similar behavior was observed for the other data sets, thus leading to the final choice of γ = 0.5, and excluding γ = 0.25 from the analysis (see Table 1 for the computational burden of the runs performed on our medium size German and Spanish data sets).
The above considerations suggest the choice of γ = 0.5 for the aggregation of the rows of the transition probability matrix that will be used in the bootstrapping method.This choice is also supported by the statistical analysis that we performed on 5000 bootstrapped series to assess the performance of the method we advance here (this analysis shall be illustrated in detail in the following subsection).Indeed, the 5000 series obtained with γ = 0.5 reflect the statistical properties of the original sample in a rather satisfactory manner, and a very low probability of duplication over the 5000 tested cases is observed.Therefore, for the bootstrapping method, we selected the optimal partition corresponding to γ = 0.5 for each The number of classes corresponds to the smallest value of q for which the optimal objective function value can be attained.A "-" means that the limit of 20 h on the running times was reached by the optimization procedure γ Spain Germany indicate the corresponding 2 nd order aggregated transition probability matrices. 6The two singleton partitions obtained for γ = 0 and q = m are denoted by π S and π G , respectively.Partition π With reference to our medium size data sets, Table 1 reports the values q(γ ) and d(γ ) of the solutions found for model (12) w.r.t the different values of γ = 0.5, 0.75, 1, 1.25, 1.5, together with the time spent to find such solutions.In this tests we set a limit of 20 h on the running time: in Table 1 we put a "-" when this limit was reached, and, in these cases, we report the best solution found so far.In the other cases, q(γ ) and d(γ ) correspond to q * (γ ) and d * (γ ), respectively.It must be noticed that, in spite of the premature arrest of the procedure, in all cases in which the time limit was reached the value of the best solution found was always equal to its lower bound γ .For the model solution, we used the popular AMPL 8 software, calling the GUROBI v6.04 solver for the optimization.The experiments were carried out on a workstation equipped with an Intel Core i7-4810MQ processor with 2.28 Ghz clock rate and 16 GB RAM.We also used the default GUROBIs values of 1e-10 and 1e-4 as absolute and relative tolerances on the gap between the lower and upper objective bound, and the gap between the MIP objective bound and incumbent solution objective, respectively.

Statistical analysis
The theoretical justification of our experiments is grounded on the definition that any trajectory generated by a stochastic process is a random sample extracted from an infinite population.Starting from this point, one should expect that the distribution of any statistics built on a population (i.e., moments, minimum, maximum, etc.) can be used to identify an acceptance range within which the same statistics, computed on a single observation ω, is expected to fall with a given probability, under the null hypothesis that ω is extracted from that population.If a large set of statistics calculated on ω falls on average into the corresponding acceptance range, then the null hypothesis cannot be rejected.
For each market (Spain and Germany), we evaluate the performance of the bootstrapping method we advance in two different scenarios: (γ = 0.5), which are expected to generate higher diversification and lower similarity than those of the conservative scenario, i.e., π S and π G .
We generated 4 sets of 5000 bootstrapped series (one set for each partition) with length = 2188 for the Spanish case and = 2515 for the German one.Figures 7 and 8 give an idea of how the bootstrapped series look like, respectively, for the Spanish market and for the German instance, based on partitions π milp S and π milp G .Here, the bootstrapped series include the exponential trend and weekly seasonality initially removed from the original samples (see Appendix 1).Each value of the series is classified as deterministic (square mark) or probabilistic (diamond mark).
We can make the following considerations: -both the bootstrapped series in Figs.7 and 8 reproduce the spikes observed in the original series (see Figs. 1 and 2); -the normal trading regime also appears satisfactorily reproduced.Indeed, the two series take values in ranges strongly overlapping those of the original one; -weekly seasonality is clearly distinguishable, as well as, a slightly positive trend; -the frequency of deterministic values is 1 % for both the Spanish and German cases (which closely corresponds to the frequency of deterministic rows in the original transition probability matrix).The bootstrapping method probabilistically reproduces sequences of the original series and occasionally such segments are interleaved by deterministically chosen values.At this point let us observe the key advantage of the Markov chain bootstrapping method: depending on the different probability distributions associated to each condi- tioning event, the bootstrapping method switches from deterministic (e.g., in the case of spikes) to highly unpredictable.
To evaluate the statistical properties of the bootstrapped series with respect to their original counterparts more rigorously, for each bootstrapped series, we elaborated the following statistics: 1. average 2. standard deviation 3. skewness 4. kurtosis 5. minimum 6. maximum 7. autocorrelation at lag k, k = 1, . . ., 8 8. slope of a linear regression model, b, with The statistics 1. − 6. regard the distribution of prices, while 7. and 8. are more concerned with the dynamic structure of the series.The autocorrelations at lags k = 3 to k = 8 are observed to check if the similarity between the original and the bootstrapped series is kept beyond the order k = 2 used to define the driving process.
For the distribution of each of the above statistics 1. − 8., Tables 2 and 3 report (for Spain and Germany, respectively) the 5th and the 95th percentiles, together with the corresponding value for the original series.To evaluate these distributions, we also report the percentile rank, i.e., the percentage of cases in which the statistic is less than or equal to the original observed.
We can make the following observations: i.For the four partitions π S , π milp S , π G , and π milp G , all the statistics computed for the original series take values in between the two above percentiles.This is true also for the autocorrelations at lags k = 3 to k = 8, with the exception of the autocorrelations at lag k = 7 for the Spanish case.ii.The percentile ranks are more fluctuating in the Spanish scenarios than in the German cases: the lowest percentage for Spain is 46 % (see row "average" of partition π S in Table 2), while it is 51 % for Germany (see row "average" of partition π milp G in Table 3).The highest percentages in the Spanish and German cases are 99 and 94 %, respectively.In general, though, the differences can be considered negligible.iii.There seems to be no remarkable difference of values between the 5th and the 95th percentiles generated with the two partitions π S and π milp S , therefore suggesting that the bootstrapped series obtained with the two partitions are rather similar.The same observation applies to the German case.iv.Not all the 5000 series generated in each setting reported a spike.This feature reflects the desirable property that a rare event, like a spike, does not appear regularly.
The results above suggest that we cannot confute the hypothesis that the original series was generated by the same Markov process that produced the 5000 bootstrapped series, both in the Spanish and in the German case.In addition, the statistical properties of the series generated by the bootstrapping method based on π milp S and π milp G do not significantly differ from those of the series generated by the procedure based on π S and π G , respectively.
To conclude, our empirical analysis provides evidence that the use of the aggregated transition probability matrix M * does not alter the "information" contained in the original matrix M significantly.Thus, M * can be adopted in the bootstrapping method to generate resampled series with the same features of the original one.

Conclusions
In this paper we study the economic and financial phenomena evolving according to a Markov chain, and we apply bootstrapping to improve parameter estimation obtained from a sample of data of the observed series.This nonparametric, model-free approach is particularly suited for time series with nonlinear dependence in the data, like electricity prices, and its performance is improved here by adopting a Markov chain reduction that maintains the typical features of the observed time series.The key aspect of our work is that a Mixed Integer Linear Program, specifically tailored for our application, is adopted to formulate and solve the reduction problem.Even if solving the proposed model is computationally hard from a theoretical point of view, in our application on medium size real-life cases, the problem was solved exactly within reasonable computational times.We note, however, that time is not a crucial aspect for bootstrapping, since it is a numerical method typically adopted in strategic decisions, estimation, scenario generation, and other analysis purposes that do not involve time critical decision making.With this approach the purpose of simplifying the original transition probability matrix, and yet preserving the law driving the process, can be considered satisfactorily fulfilled, for the case of electricity prices that are particularly hard to replicate as well.
Since for large scale partitioning problems (i.e., when the transition probability matrix has a very large number of rows) a heuristic approach is still advised, the development of ad hoc heuristics for this kind of problems is one of our future lines of research on this subject.
Another interesting development could be the introduction of new constraints for modeling specific relations that characterize the states of the Markov chain under study in the proposed MILP.This could help reducing the number of feasible partitions and, therefore, it may be a viable strategy to reduce the computational effort for solving our reduction problem in the general case.

Fig. 6
Fig. 6 Index E V (γ ) for the German25 data set milp S consists of 72 classes (19 of which correspond to deterministic 2-states, and one contains the non observed 2-states).Partition π milp G has 43 classes (19 of which correspond to deterministic 2-states, and one contains the non observed 2-states).

Fig. 7
Fig. 7Bootstrapped Spanish electricity prices.The square mark indicates that the selected value is the ending point of an observed probabilistic 2-state, the diamond mark indicates that the bootstrapped values are the last points of observed deterministic 2-states

Fig. 8
Fig.8Bootstrapped German electricity prices.The square mark indicates that the selected value is the ending point of an observed probabilistic 2-state, the diamond mark indicates that the bootstrapped values are the last points of observed deterministic 2 -states

Table 1
Summary of the results of the MILP for the two data sets

Table 2
Percentiles of the distributions of some indexes computed over the 5000 bootstrapped series of Spain for two scenarios: the conservative scenario (partition

Table 3
Percentiles of the distributions of some indexes computed out of the 5000 bootstrapped series of Germany for two scenarios: the conservative scenario (partition

Table 4
Coefficients estimates of an exponential regression model of trend and weekly seasonality applied to the series of electricity prices of Spain and Germany