Non-parametric Regression Model for Continuous-time Day Ahead Load Forecasting with Bernstein Polynomial

Growing perception of diverse generation resources and demand response operation of power system with high uncertainty has increased the attention to a more dynamic and accurate day-ahead load prediction. In this paper, we develop an stochastic model for short term load forecasting based on the Gaussian process, in which the non parametric estimator of the regression functions are obtained by using Bernstein polynomials. One of the major features of this model is its ability to predict a continuous load at any time of the day with a regression function. We use the historical data for training and the constrained marginal likelihood problem is optimized for finding the hyperparameters of the model. Real data sets from California ISO were used for training and testing the model. The results are compared to the day ahead piecewise constant load and the real time load. The common error measures are employed to infer the deviation of the load forecast from the real data.


I. INTRODUCTION
In power system operation, the electricity demand must be well matched with the supply. This fact requires an accurate dynamic load forecasting that the generation must fulfill. Beside automatic generation control, other factors such as unit commitment and economic dispatch are also dependent on the load forecast. However, uncertainty in the load demand and power generation availability leads to uncertainty in the day-ahead scheduling. This characteristic encourages the use of stochastic process to define the probability of supply and demand for unit commitment which needs to be done several hours in advance. The current unit commitment provides hourly schedules for the generations and assumes the load demand to be constant at each hour. However, in reality neither generation nor demand has the stepwise profile at each hour. Several load forecasting approaches for short term (hour to week ahead) time scales are proposed. They can be categorized as artificial intelligence methods and statistical techniques. Artificial Neural Network (ANN) [1] and neurofuzzy network [2] are between those artificial intelligence methods that has been used for load forecasting. There are several works that have been done on hourly load forecasting in day-ahead scheduling with stochastic models [3]. Among statistical techniques, regression models and time series are more popular. Time series methods assume internal structures for data, such as autocorrelations and trends, which they try to detect and explore. ARMA (Autoregressive Moving Average) for stationary processes and ARIMA (Autoregressive Integrated Moving Average) for nonstationary processes are most common used time series method [4], [5]. Regression models [6] uses several features for the linearization of the function to interpret the relation of different factors in an easy way. Multivariate polynomial and exponential regression [7] has been proposed for both short term and medium term load forecasting. Support vector machine, based on statistical learning theory is another method that has been used for load forecasting [8]. none of the methods that has been applied for the short term load forecasting offers a continuous time load profile estimation. There are several factors which can affect the load forecasting including the time, weather and customer classes [9]. However, in this work, we mostly focus on continuous time modeling of hourly loads without specifically involving other parameters. This paper describes a stochastic model based on the Gaussian process with Bernstein polynomial for the continuous time load estimation. We develop a stochastic model which estimate a continuous time function for each hour in the load profile.

II. GAUSSIAN PROCESS
In a Gaussian process, every input as x is associated with a normally distributed random variable. It is often shown as a vector x as it can include several variables. The output or target y may either be continues or discrete. For the first case, the process is known as a regression and for the latter one as a classification. Sets of samples with n observation D = (xi, yi)|i = 1, , n is using as a training data to define a function f that predicts the output for any possible input variable. Therefore, the characteristics of this function needs to be defined. Estimation of this function is based on the prior information and a finite set of observations. Wide range of functions can be assumed which Gaussian process can help in choosing between infinite set of possible functions. A Gaussian stochastic process governs the properties of the function. Prior is considered as our belief about the function in absence of knowledge about the data. Gaussian process specifies that the prior variance doesnt depend on the x variable and also it can be assumed that the mean of f (x) for any x is zero independent of its value. Then the combination of the prior and data leads to the posterior distribution of the function. Gaussian process is not a parametric model and doesnt intend to fit the data but its uncertainty reduces close to the observations. The specification of prior is important for inference of the function property. Smoothness and stationary properties can be specified by the covariance of the Gaussian process. Therefore finding a suitable covariance function is a challenge. In time series, the key aspect is how observations are related to each other in time. This concept is formalized through the covariance between elements which measures the degree of second order variation between two elements at two different times. If the statistical properties doesnt change over time, e.g. expectations, variances and other properties remain the same, the process is considered as stationary. If the mean or the variance of the time series changes over time, the process is non-stationary. The aim is to inference about the relationship between the inputs and the targets. Standard linear regression model with Gaussian noise can be defined as: x is the input vector, f is the function value and y is the observed target data. The observed value differs from the function values by additive Gaussian noise with zero mean and variance σ 2 n , ε ∼ N(0, σ 2 n ). The vector of the noise or error is distributed normally and is independent of the random function f . A Gaussian process is completely specified by its mean m(x) and covariance k(x, x ′ ) function.
where δ ii ′ is the Kronecker's delta and δ ii ′ = 1 if i = i ′ , otherwise it is zero. If there wouldn't be any noise in the observations, the terms related to the Gaussian noise can be neglected. Mean function m(x) and covariance function of f (x), k(x, x ′ ) are defined as: E stands for the expected value. To define f (x) in the form of Gaussian distribution, it will be: where µ and Σ are vectors of the mean and the covariance. For multivariate normal distribution with positive definite covariance matrix, the Gaussian probability density function is expressed by: The regression function f (x) is supposed to satisfy certain shape restriction with nonparametric form for all x values and with realization of the observation points. In our problem, Gaussian process is defined over the time, where the index of set of variables is time. For a fixed time, our aim is to estimate:f wheref (x t ) is the solution of the prediction problem which can be obtained by the maximum likelihood estimation. Maximum-likelihood estimation (MLE) is a method for estimating the parameters of a statistical model with available observation data. It is an interesting approach which can be used also for continuous time processes [9]. As our data assumed to have Gaussian distribution, logarithmic marginal likelihood function is defined as: We use the term marginal to emphasize that we are dealing with a non-parametric model. Assume Bayesian linear regression model for f (x): where β is a vector of weights and considered to be a prior parameter. ϕ(x) is a function that maps the D dimensional input vector x to N dimensional feature space using a sets of basis function (e.g. Bernstein polynomials). As long as the projections are fixed functions, the model is linear in the parameters [10]. We need to define the prior parameters before look at the observation. Prior is not dependent on the training data but it has some properties of the function. We take prior on β to be Gaussian: without loss of generality we assume for now that the mean is zero, b = 0. Then the covariance function of the f (x) will be: As the Gaussian process uses priors, the smoothness of the prior is defined by the covariance function. Choosing a proper covariance matrix for our model is important. For example, if we expect that for close-by input variables, the output will be close (continuity assumption), the covariance function must satisfy this characteristic. The covariance between the outputs is written as a function of the inputs. There are a number of common covariance functions available. The squared exponential covariance is used in our model which corresponds to a Bayessian linear regression model with an infinite number of basic functions: This function is infinitely differentiable. If the covariance wont be zero, we say that the errors of x p and x q are correlated. The covariance can be normalized to form a correlation coefficient: Usually the covariance functions have some free parameters called hyperparameters. In case of vague prior information, we use a hierarchical prior, where the mean and covariance functions are parameterized in terms of hyperparameters. The squared exponential covariance function with hyperparameters in one dimension has the following form: l is called the length scale of the process which practically shows how close two points have to be to notably influence each other. σ 2 f is the signal variance and δ pq is a Kronecker delta which is one when p = q and zero otherwise. We can now find the values of the hyperparameters which optimizes the marginal likelihood based on its partial derivatives which are easily evaluated.
θ m includes the hyperparameters of mean function µ = m(x) (which can be the coefficient of polynomial function) and θ k the hyperparameters of the covariance function, Σ = k(x, x ′ ).
III. PREDICTIVE GAUSSIAN PROCESS By combining the likelihood and the prior, the posterior gather everything we know about the parameters.The major goal of the posterior is to be used for the prediction of the future cases.
IV. BERNSTEIN POLYNOMIAL ESTIMATOR Polynomials are among popular mathematical tools for approximating and forming spline curves of any function with any desired accuracy. They have a variety of basis functions which can fit different applications. They can be easily differentiated and integrated and they have other useful properties. Bernstein basis is a commonly used base for the space of polynomial which can suitably be used for load profile modeling. The Bernstein polynomial estimator can be used for nonparametric curve estimations. It has many useful properties in preserving the different shapes of the regression functions. Using Bernstein as an estimator is computationally efficient as its coefficients can be computed easily as a solution of quadratic programming problem [11].
The Bernstein basis polynomial functions of order N are defined as: And the Bernstein polynomial of degree N can be expressed by: Where β k is a coefficient of Bernstein polynomial. Load profile at each hour (segment) can be addressed in the interval of [01] which can have several data points (e.g. the load value every 5 minutes) and each segment can be modeled by one Bernstein polynomial. However, if each segment will be defined in the closed interval of [a, b], then the corresponding polynomial is written as: Y 1 ), ..., (X n , Y n )} , to obtain the spline approximation of the load curve from the observations, the Bernstein coefficients must be calculated. Least Squares method is a standard approach to approximate the solution by minimizing the sum of the squares of the errors made in the results.

V. CONTINUITY AT THE JOINTS
As the goal is the piecewise approximation of the load profile, series interconnection of hourly curves linked together at joints must be considered. This requires the continuity conditions at the joints or control points. Different level of continuity can be defined. Parametric continuity of C0 refers to the continuity of two consecutive segments at the joint point: Where s is the number of segment and j is the index of a joint point of the segments. To consider the continuity condition in the Bernstein modeling problem, the constrained least square problem must be solved. The C 0 continuity condition is added as an equality constrained to the optimization problem and least square formula will be minimized subjected to that. However, we might require other orders of continuity to satisfy the smoothness and accuracy of the approximation. The curve segments should have the same slopes when they join together. Therefore, C 1 continuity refers to the slope matches where the curves join.
Derivative of Bernstein polynomials of degree N are polynomials of degree N − 1 and this derivative can be written as a linear combination of Bernstein polynomials: The C 1 continuity condition adds more equality constraints to the optimization problem.

VI. CONSTRAINED LEAST SQUARE PROBLEM
Our optimization problem is maximizing the marginal likelihood with respect to the hyper parameters. In addition to that the optimization must also satisfy continuity criteria. The load profile estimation must have the C 0 continuity at joints, and that the slope of the curve at joints, must have C 1 continuity. This is a linearly-constrained quadratic minimization, an ideal problem for Lagrange multipliers. If the covariance considered to be constant, the marginal likelihood must be optimized with respect to the Bernstein coefficients and satisfies the continuity conditions. By substituting the mean function m(x) = µ with Bernstein polynomial function b k (t) T β N , the augmented objective function (the Lagrangian) is then: C and d are the matrices with continuity parameters. Minimizing L with respect to β N and λ results in a system of linear equations for the optimum coefficients β N * and Lagrange multipliers λ * .
If C has independent rows and β N C has independent column, the KKT matrix is invertible.

VII. MODEL ACCURACY EVALUATION
To evaluate our model, two error types have been considered and employed. The Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) are common types of error estimations. While the MAE gives the same weight to all errors, the RMSE gives errors with larger absolute values more weight than errors with smaller absolute values. Mean Average Percentage Error (MAPE) also has been commonly used to measure the forecasting performance. Root mean squared residual error at each point provides the squared difference between the observation values y and the predicted function m(x). For an unbiased estimator, the RMSD is the square root of the variance, known as the standard error.
Mean absolute error is used to know how close predictions are to the actual observations.
The root mean squared prediction error is computed on outof-sample data: m(x * i ) is the estimated function at test values and y * i is the true values for the test data points.

VIII. RESULT AND DISCUSSION
To analyze our proposed model, 26 sets of five-minute net-load data from California Independent System Operator CAISO, is used for training the process. CAISO is the largest Independent System Operator ISOs in the world managing about 80 percent of California's electric flow. These data are selected for a specific day of a week from six consecutive months (May-Oct), Fig. 1. The load has been normalized with respect to its maximum. The results are compared to the piece wise constant loads to show the effectiveness of continuous load forecasting for each hour. As the load for each hour interval can be modeled with a continuous Bernstein function, it will provide a more realistic load profile which can be used for the real-time economic dispatch. Bernstein polynomial of any degree can be used for the modeling but there is an optimal degree which can be a better fit to the load profile. In order to choose a proper order of Bernstein polynomials, the prediction error is estimated for different orders from N=1,...,10. As it can be seen in the table I, Bernstein order from 2 to 5 can fit the load profile with reasonable accuracy but with orders above 5, it cannot follow the shape well anymore and will show large oscillations for certain parts.
As we haven't considered the weather variables such as temperature in the model, the collected historical data needs to be in the same weather condition to cover the close load profiles with the least diversity. Forecasting a day-ahead load profile also requires the use of the trained model with the same seasonal condition. Load profile of each hour of the day get modeled by continuous Bernstein polynomials. Scheduling of the generation with their real ramping will benefit from the continuous load forecasting and leads to economical benefits.

IX. CONCLUSION
In this paper, a stochastic model based on the Gaussian process was presented for a short-term load forecasting. In this model, the non-parametric estimator of the regression functions is obtained by using Bernstein polynomials. Real data sets from California ISO were used as the historical data for training and the constrained marginal likelihood problem was optimized for finding the hyperparameters of the model. The common error measures were employed to infer the deviation of the load forecast from the real data. Bernstein polynomial of any degree could be used for the modeling but there are optimal degrees which can be a better fit to the load profile. As the load for each hour interval can be modeled with a continuous Bernstein function, it will provide a more realistic load profile which can be used for the real-time economic dispatch.