CHAPTER 1

INTRODUCTION

1.1. Introduction and Background of the Study

Regression analysis is a statistical process which aims to explore the functional relationship between two or more variables so that, a dependent variable (output) can be predicted from one or more of independent variables (input) (Kutner et al., 2005). Regression analysis estimates the conditional expectation of the response variable given the explanatory variables. In other words, it estimates the average value of the dependent variable when the independent variables are fixed. This estimation can be done by using the proper technique for the phenomenon or the data set under study such as the ordinary least squares method. The ordinary least squares method (OLS) is classified as one of the prevalent estimation techniques in the regression analysis. Further, the OLS is the most popular estimation method in the linear regression community due to its superior properties and ease of computation, provided that the Gaussian Markov assumptions are met. In addition, the OLS estimator is the best linear unbiased estimator (BLUE), when the random errors are independent identically distributed (iid) normal. Unfortunately, the assumptions of the linear relationship between the variables and the normal distribution of the error term are violated in the most of the real life applications. Furthermore, the OLS estimator is not robust against unusual data points which often appear in real life applications. In other words, the OLS estimator has very low breakdown point which is equal to 1/n (Maronna et al., 2006), where n is the sample size. That is, even one point (abnormal) could change the estimate of least squares dramatically in the wrong direction

(Rousseeow and Leroy, 1987; Kamruzzaman and Imon, 2002; Maronna et al., 2006).

The assumption of the normal distribution of the error term is violated in the presence of one or more outlier observations. Belsley et al. (1980) reported that the outliers are those points either alone or together with several other points have the largest influence on the computed values of different estimates. Hawkins (1980) defined an outlier observation as the observation that deviates so much from the other observations as to arouse suspicions which it was generated by a various mechanism. Muñoz-Garcia et al. (1990) defined the outlier observation as “An outlier is an observation which being atypical and/or erroneous deviates decidedly from the general behavior of experimental data with respect to the criteria which is to be analyzed on it”. Barnett and Lewis (1994) defined outlier points as those points that are markedly far from the majority of points in a data set. In general, there are several classes of outliers in the regression problems. Observations that are outlying in the direction are expressed as outliers or vertical outliers. In contrast, the observations which are outlying in the -direction are called high leverage points (HLP). However, there is an urgent need in the regression analysis to find out whether HLP have much impact on the fitting of a model or not (Belsley et al., 1980; Rousseeow and Leroy, 1987).

The other serious problems that affect the predicted model in addition to outliers and the non-linearity relationship among variables are problems of high-dimensional and sparse (p is larger than the number of observations n). The curse of high dimensionality refers to how certain algorithms such as algorithms in numerical analysis, sampling, combinatorics, machine learning and data mining that may perform poorly in high-dimensional data. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In high dimensional data, a matrix related to some algorithms may become singular and some additional information such as regularization, Bayesian prior and others need to be added to obtain standard solution.

Recently, several procedures which deal with these problems separately are available. However, there are not extensive studies reported in the literature which takes into consideration the presence of the non-linearity, outliers and high dimensional problems (full or less than full rank) simultaneously. As a result, the search for alternatives that feature the necessary flexibility to handle these issues has become an urgent necessity such as nonparametric methods especially learning machines.

1.2 Importance and Motivation of the Study

Nonparametric regression technique is a form of statistical regression analysis in which there is no a predetermined form of the predictor but it is constructed based on the information derived directly from the data. Whereas the classical regression statistical techniques stand upon a strict assumption in terms of they assume that the underlying probability distribution of the data is known and the relationship among the variables takes a linear form. However, in real applications, often we confront with distribution-free regression problems with a non-linear relationship between input and output variables (Ukil, 2007). One nonparametric method which is not requiring knowledge of the underlying probability distribution of the data, as well as its ability to deal with non-linear relationship is the support vector machine. Support vector machine (SVM) is one of the comparatively new and promising techniques for learning separating functions in classification problems (SVC) or for performing function estimation in regression problems (SVR).

Support vector machine was initially applied for classification tasks (Cortes and Vapnik 1995), but shortly, the formulation was extended to deal with regression problems (Smola, and Vapnik 1997; Vapnik 1995). The advantages of support vector machine are its ability to modeling the non-linear relationships by employing kernel trick and its excellent generalization ability on the real applications of the classification and regression problems while it is still capable of producing sparse model (not all observations are needed to find the optimal model) (Ceperic et al. 2014). The common formulation of support vector machine for regression is Vapnik’s ?-tube SV regression (?-SVR) (Smola, and Vapnik 1997). The ?-SVR produces predictive model depends only on a subset of the training points whereas it ignores any points within the threshold ?. This step reveals the potential problem: if the value of threshold ? is small, then the resulting model depends on a greater number of the overall training points, thus making the resulting solution non-sparse, as demonstrated in Guo et al. (2010).

Both of the parametric and nonparametric regression techniques are affected by the presence of single or multiple enormous points in a data (the parametric methods certainly are most influenced than nonparametric methods). Many researchers reported that the real data sets mostly contain unusual points ranging from 1% to 10% (Hampel et al. 1986; Wilcox, 2005). Outliers and HLP have a great effect on the values of various estimates, which leads to misleading conclusions result in wrong decisions. Hence, it is necessary to detect those unusual observations and removing them before embarking on building the predictive model (Cook, 1977) or orientation of the robust methods (Huber, 1973) which minimize the impact of outliers instead of removing them completely from the data. It is worth mentioning that the choose one of these methods is up to the researcher.

There are several parametric methods used for detecting single or multiple outliers and HLP. Unfortunately, they are not successful to identify multiple abnormal points in the data sets due to the effects of masking and swamping problems (Rousseeuw and Leroy, 1987). On the other hand, these methods can not deal with less than full rank data. To address this problem some researchers explored the use of non-parametric methods for outlier detection in cases both of full rank and less than full rank. Jordaan and Smits (2004) suggested using standard support vector regression (SSVR) for outlier detection. The idea of this technique is by running the SV regression model many times and detects points which are suspected as outliers. Nishiguchi et al. (2010) pointed out that some problems arise when applying it with real applications. It requires high computational costs for multiple outliers in the data because detection of an outlier requires a number of iterations of the calculation; the trial and error is used for accurate detection, since it is not clear how to identify the outlier threshold value. To remedy this problem, Nishiguchi et al. (2010) developed the modified support vector regression

(MSVR) technique for outlier detection by employing new trade-off parameter (?), which is successful in identifying outliers and HLP. Nonetheless, the MSVR approach is suitable for few outliers in the data, since one iteration is required to detect one outlier. Consequently, computational costs become close to those arising from the standard SVM regression method in case of presence multiple outliers. Further, there is no clear rule for choosing the value of threshold parameter, although it comes with fixed value of this parameter. The shortcoming of these methods has inspired us to develop new techniques to improve the performance of standard SVM regression for outlier detection, which we call the fixed parameters support vector regression (FP-SVR). The proposed two methods are expected to achieve accurate detection of outliers and HLP (only bad leverage points) with fixed parameters during one iteration.

This thesis also concerned on the use of robust methods to address the problem of the presence of outliers and bad leverage points (BLP) in multiple linear regression models. As we mentioned previously the OLS estimator is seriously affected by the presence of outliers. One of the most common alternative techniques to OLS of addressing the presence of outliers is the robust regression procedure (Hampel, 1974). There are many robust regression methods in the literature, such as the least absolute values (LAV), the Mestimator, generalized M-estimator (GM1-estimator), the least median of squares (LMS), the S estimator, the least trimmed squares (LTS), the MM estimator and new class of GM-estimator (GM6) proposed by Coakley and Hettmansperger (1993). Yohai and Zamar (1988) firmly recommended that one of the goals of robust regression technique is to achieve: (a) a high breakdown point of nearly 50%, (b) a bounded influence function and (c) a high efficiency, simultaneously. According to this recommendation, only GM6 method achieves the three conditions, (a), (b) and (c) simultaneously. Regrettably, this method considers the good leverage points to be bad leverage points, which means that its efficiency tends to decrease with the presence of “good” leverage points. This limitation has inspired us to develop a new class of GMestimators based on a fixed parameters support vector regression techniques that have been proven in Chapter 3, takes into account minimizing the impact of the bad leverage points only on the model, and we call it GM-SVR.

This thesis also addresses the problem of high dimensionality in linear and nonlinear regression models. It should be noted that the sparsity feature (less complexity), which is characterized by the SVR model by itself is not sufficient to ensure good generalization to the model in addition to the problem of nonsparse that accompany the small threshold, ? near zero (Ceperic et al. 2014). It is well known the support vector regression is a fully nonparametric approach, which makes it a flexible but at the same time it is suffering from precision decrease when increasing the covariates which is called the curse of the highdimensionality (Härdle et al. 2004). For this reason, the alternative is used to cope with this drawback. One of the common techniques to improve generalization accuracy and overcome the curse of the high dimensional problem is the single index model. Ichimura (1993) suggested a semiparametric model which combines between the flexibility of the nonparametric model and the high accuracy of the parametric model called single index model. This model summarizes the covariates within a single variable called index. To the best of our knowledge, there is no existing research in literature which used SVR to evaluate the unknown link function of the single index model. This inspires us to propose a new technique that uses the SVR model to estimate the unknown link function of the single index model namely the single index support vector regression (SI-SVR).

It should be stated that the SI-SVR model does not have the ability to modeling the rank deficient data. Furthermore, the efficiency of the resulting model could be declined, and less accurate predictions will be produced when unnecessary predictors are included in the model (Tibshirani, 1996; Hastie et al., 2009). This requires development of a new method to overcome this issue. This can be done by employing the concept of variables selection to achieve the possibility of modeling by single index model which we call the elastic net single index support vector regression (ENSI-SVR).

1.3 Research Objectives

The main goal of this thesis is to investigate the high dimensionality problems for linear and nonlinear regression models in the presence of outliers (outlying in coordinates X and Y). The classical estimation methods such as the ordinary least squares (OLS) method are not robust against outliers. Moreover, they can not evaluate the nonlinear relationships and the difficulty to meet all the assumptions for high-dimensional data. The foremost objectives of our research can be outlined systematically as follows:

To propose new improved diagnostic methods for the identification of multiple outliers based on two types of kernel functions.

To formulate a new robust estimation method to remedy the presence of outliers in the data for the linear regression model.

To propose a new semi-parametric method to cope the curse of high dimensionality combines between the high precision of parametric methods and the flexibility of nonparametric methods.

To develop the elastic net penalty approach for selecting variables in a single index support vector regression model to overcome the curse of high dimensionality when the number of predictors, p is larger than sample size n.

1.4 Scope and Limitation of the Study

The linear and nonlinear regression models are widely used in many areas of studies such as bioinformatics, economics, financial predictions and social sciences. In the real situation, these regression models have many practical uses. However, the most applications of the linear regression models are evaluated using the OLS method because of the ease of computation and its optimal properties when the underlying assumptions are met. In reality, the OLS estimator is not resistant to outlying samples; even one outlier can destroy the OLS estimator. The alternative procedures which used to address this issue are detection methods and robust statistical methods. Flexible techniques are suggested to the identification of outliers and HLP such as SSVR and MSVR in cases of full and less than full rank data. Nonetheless, these existing methods basically focus only on the identification of leverage points without taking into consideration their classification into good and bad leverage points. It is very important to detect and classify the good and bad leverage points, as only bad leverage points are responsible for the misleading conclusion about the fitting of the regression model. On the other hand, many robust statistical estimation techniques are suggested such as LMS-estimator, LTS-estimator, M-estimator, GM1-estimator, MM-estimator, and GM6-estimator. However, some of these methods are not robust against leverage points and some methods are considered the good leverage points as bad leverage points.

The other technique of statistical modeling is the nonparametric procedure which used to evaluate the nonlinear relationships and high dimensional problems including when the number of predictors p much greater than sample size n. One of the most effective methods in the nonparametric machine learning community is the support vector machine (Frohlich and Zell, 2005). However, the ability of the SVM model to evaluate the high dimensional problems is decreased because of the resulting model is non-sparse when the threshold is small. Furthermore, the generalization performance of SVM depends heavily on the right selection of the hyper-parameters C and ?, so the major issue for practitioners attempting to apply SVM is how to set these parameter values to guarantee a good generalization performance for a training data set. It should be noted all calculations have been implemented using R software.

1.5 Overview of the Thesis

In accordance with the objectives and the scope of the study, the contents of this thesis are structured in the eight chapters. The thesis chapters are organized so that the study objectives are apparent and are conducted in the sequence outlined.

Chapter Two: This chapter briefly presents the literature review of the least squares estimation method and the violations of its underlying assumptions such as the departure of normality and the presence of outliers. The literature review of the support vector machine for regression and its basic idea to employ the kernel trick during the estimation process are highlighted. The outliers, and leverage points and their diagnostics methods are also discussed. Moreover, basic concepts of robust linear regression and some important existing robust regression methods are also reviewed. Bootstrapping methods are also briefly discussed. In this chapter, the main idea of the single index model and its estimation methods are also discussed. Finally, the concept of variable selection and some of penalization methods are also briefly highlighted.

Chapter Three: This chapter discusses the existing SSVR and MSVR which are developed by Jordaan and Smits (2004) and Nishiguchi et al. (2010). The new proposed methods (FP-SVR) for the identification of multiple vertical outliers and bad leverage points are presented in this chapter. The steps for proposed FP-SVR methods and its algorithm are also highlighted. Finally, some real and simulation studies are discussed to evaluate the performance of the proposed methods.

Chapter Four: This chapter deals with the development of the GM-estimator based on FP-SVR (denoted by GM-SVR) for data having outliers and bad leverage points. Two Monte Carlo simulation studies and two numerical examples are carried out to assess the performance of the proposed method.

Chapter Five: In this chapter, we present the proposed semi-parametric model to address the high dimensional problem, namely the single-index support vector regression (denoted by SI-SVR).The new proposed technique is useful to get rid the so-called the curse of high dimensionality. In this respect, two types of data are considered, the linear and nonlinear relationships. The numerical and simulation examples are also discussed to assess our proposed method.

Chapter Six: In this chapter, the concept of variable selection is utilized to achieve non-singular predictive matrix when the number of predictors p larger than sample size n. Then, the proposed model, namely the elastic net singleindex support vector regression (denoted by ENSI-SVR) can be used to remedy the curse of high dimensionality. The semi-parametric proposed model combines the high accuracy of parametric methods and the flexibility of nonparametric methods. A Monte Carlo simulation studies and numerical example are given to assess the performance of the proposed method.

Chapter Seven: This chapter provides the summary and detailed discussions of the thesis conclusions. Areas for future research are also recommended.

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

Regression analysis is a statistical technique, which is widely used for modeling and analyzing several variables, where its use has substantial overlap with the field of machine learning. The main goal of the analysis of the regression models is for estimating the functional relationship among two or more quantitative variables so that a response variable can be predicted from one or more explanatory variables. Regression analysis also can be used to understand which among the predictors are related to the response variable, and to explore the forms of these relationships. In this study, we deal with two of the most commonly used regression models in this area, which are the linear and nonlinear regression models and provide the scientist with a powerful tool. This chapter focuses on literature reviews of the most important issues in this thesis, which are related to robust diagnostic methods, modern diagnostic methods, robust estimation technique and nonparametric learning machines for linear and non-linear regression models having outliers and high dimensionality problems.

2.2 Background and Notation

Regression analysis provides the answers to the some vital questions about the functional relationship between a dependent variable () with one or more of predictors (), such as estimating the influence of changing an independent value on the response value, prediction of the future values of a dependent and interpretation of which predictor variable is important (Montgomery et al., 2015; Weisberg, 2005). The random error term is added to the regression model to account the for individual differences, as in the following formula

where is the random error of the i observation, p is the number of predictor variables and n is the sample size of the data. The linear function between and is named the linear regression model, which can be defined as

The matrix form of the above regression model is given by,

where is an vector of the response variable, X is an matrix of regressor variables, is an ( vector of the unknown regression parameters to be estimated (including the intercept parameter) and is an random vector assumed to be independently identically distributed (iid) normal with zero mean and standard deviation. The model (2.3) can be expressed as

(2.4)

The alternative standardized form of regression model is commonly used in regression analysis because the difficulties of directly comparing between the estimated coefficients which belong to the variables are mostly measured in different units (Montgomery et al., 2015; Kutner et al., 2005).

2.2.1 The Standardized Form

The standardized form is commonly used in many multiple regression applications to reduce the effects of variation of units of measurement. The

Standardized form is defined as follows for all

(Montgomery et al., 2015; Kutner et al., 2005).

where and are the mean and the standard deviation of the predictor variable ; ? and are the mean and the standard deviation of the dependent variable which is given, respectively by

Assume and are the standardized forms of the response and predictor variables, respectively. The linear regression model (2.3) can be rewritten as

It should be noted that the above model (2.9) does not have intercept parameter and all variables (dependent and independent) have zero mean and of one.

2.3 Ordinary Least Squares Estimation Method

The main idea behind the OLS method is to minimize the sum of squared residuals, which is given as follows

In order to find the estimated parameters, the partial derivative of (2.10) to the vector of parameters is taken and equated to zero.

The above partial derivative yields the equations that must be solved

The solution based on the OLS method is demonstrated as

Then, the fitted model can be formulated as

The difference between the actual values () and the predicted values ( is known as residuals (), defined as

To evaluate the estimated parameters, one needs the estimator of the variance-covariance matrix. The variance-covariance matrix of the OLS estimator can be defined as

where is the estimator of the scale parameter , which is given as follow

where the MSE is the mean squared error, which is used to evaluate the final fitted model.

2.3.1 The Classical Gauss-Markov Assumptions

In order to get the best inference about the true parameters of the linear regression models, some underlying assumptions have to be achieved. The Gaussian Markov assumptions are considered as the cornerstone of most statistical theories, the main Markov assumptions are presented by the following (Montgomery et al., 2015; Groß, 2003; Kutner et al., 2005).

This assumption states that the relationships among variables are linear.

The matrix must be full rank.

This assumption states that the sample size must be greater than the number of predictors.

According to this assumption, the residuals are independent, normal distributed with zero mean and constant variance, this hold the usual assumption of homoscedasticity and no autocorrelation.

This assumption states that there is no relationship among predictors and residuals (the predictors are independent of the residuals).

, which means no autocorrelation among the residuals.

When the above assumptions are met the OLS estimate is unbiased and it has the smallest variance among all other alternative estimators. Nevertheless, if the aim of the study is parameters estimation only without making inferences or conclusions about the true parameters of population, the estimation can be a best linear unbiased estimator (BLUE) based on fewer assumptions ( Hampel, 1974).

2.3.2 Limitation of the Least Squares Assumptions

It is well known that the OLS estimator is easy to compute but at the same time highly sensitive when one or more of the model assumptions are violated. The OLS estimator has some good properties when these assumptions are met that make it attractive to use which are; minimum variance among unbiased estimators, high efficiency, consistent and asymptotically normality. Unfortunately, in most real life applications, the linear assumption is not realized whereas the normality assumption of the residual is violated when there are outlying observations in the data. Besides that, OLS technique is not suitable to evaluate high-dimensional data because of the statistical difficulties that accompany the process of estimation. Moreover, in the presence of outliers in the data set, robust estimation procedures may be better than least squares even though they are not unbiased (Kutner et al., 2005; Maronna et al., 2006). In contrast, the non-parametric machines could be the appropriate techniques to remedy the problems of non-linear and high-dimensionality (Williams 2011). Furthermore, OLS method cannot be performed when the number of predictors p is greater than sample size n.

2.4 Introduction to Support Vector Machine for Regression

In most real life applications, the classical regression assumptions are difficult to be achieved such as the assumptions of linear relationships among variables and the underlying probability distribution of the data is known (Ukil, 2007). For this reason, the alternative procedures should be implemented such as the nonparametric machine learning.

Support vector machine (SVM) is one of the comparatively new and promising techniques for learning separating functions in classification problems (SVC) or for performing function estimation in regression problems (SVR). Further, the applications of SVM have been increasing lately, due to its high performance and ability to transform the non-linear relationships among variables to the linear form by employing the kernel idea (kernel function). Support vector machine was originated from the statistical learning theory (SLT) for distribution-free learning from data by Cortes and Vapnik (1995). They presented SVM as a set of related supervised learning methods utilized for solving regression problems as well as classification problems. Since then, it has generated significant interest in the machine learning community, both from theory or empirical due to its excellent performance in a diversity of learning problems, such as solving problems in bioinformatics (Ben-Hur et al. 2008), image analysis (Guo et al. 2008), bankruptcy prediction (Härdle et al. 2011), Financial prediction, marketing database (Ukil, 2007), text categorisation (Kuo and Yajima 2010) and in artificial intelligence (Frohlich and Zell, 2005). It can be pointed out to some additional reasons which stand behind the wide use of the SVM: lower sensitivity to local minima, and theoretical guarantees about its performance.

The support vector machine was initially applied for classification tasks (Cortes and Vapnik 1995), but soon at the same year, the formulation was extended to to include the regression estimation (Vapnik, 1995; Vapnik et al. 1996). The SVM is characterized by its ability to produce a solution which is global, unique and sparse (Ceperic et al. 2014). The common formulation of support vector regression is Vapnik’s ?-insesetive SVR. It produces predictive model that depends only on a part of the training points (sparse model) while it ignores any points within the threshold ?. In general, the sparse regression model is a simplified model. It can exhibit high accuracy compared to complexity ratio. In contrast, the SVR model in its training formulation has a regularization term (weight) which helps to minimize the complexity of the model. Some researchers pointed to the several benefits of the sparse regression model, which can be described in the following three points. (De Brabanter et al. 2010; Figueiredo 2003; Roth 2004; Tipping 2001; Guo et al. 2010):

Tendency to avert the problem of over-fitting: if the model is less complex, it is unlikely that the model will over-fit the data.

Reduce of computational costs in the stage of active use: the estimate time of the SV machine model is directly proportional to the support vectors (the number of support vectors) and, in a case of a decrease in this number, the execution speed is increased.

Ability to generalize: over-fitting and generalization are two closely related concepts in terms of if there is a decreased in the probability of the emergence of over-fitting; the chance of generalization ability of the model is increased.

Unfortunately, the ability of the SVR to produce a sparse model, by itself, is not enough to ensure that the model will generalize well. For instance, if the value of the parameter ? is too small, the resulting model will depend on most of the training points, and the resulting solution is non-sparse (Guo et al. 2010).

2.4.1 The Basic Idea

We consider the training data , where is the space of the input variables (e.g., ). The goal of the ? -tube SVR is to find , which has, at most, ? deviation from the factually obtained outputs , and the same time should be as flat as possible (Smola and Schölkopf 2004). In other words, we do not care about training errors as long as they are less than the value of threshold ?, but any deviation greater than this will be not accepted. Firstly, we start by describing the case of linear function f, as follows:

where is defined as the dot product in , and are the slope and offset of the regression function. Flatness in the function (2.17) denotes that one tries a small by minimizing the Euclidean norm.

This problem can be written as a convex optimization problem (Smola and Schölkopf 2004)

The implied assumption in (2.18) was that the convex optimization problem is feasible. In order to cope with the case of infeasible constraints of the optimization problem (2.18), the slack variables are introduced. This procedure leads to the formulation stated by Vapnik (1995).

where and are the slack variables that give the upper and the lower errors, and C is realized as the tradeoff between the model complexity and the number of deviations larger than are tolerated.

This corresponds to the minimizing of the so called ? -insensitive loss function described in (2.20), which presents the best traits of robustness among different common loss functions, such as Huber’s, Gaussian, and Laplacian (Colliez et al.

2006).

Figure 2.1 depicts the hypothetical situation graphically. It can be seen that, only the points which lie outside the ?-tube (the shaded region) are considered as support vectors, whereas their deviations are penalized in a linear fashion. According to Lee and Mangasarian (2001), the optimization problem (2.19) can be solved more easily in its dual formulation as long as the dimensional of the parameter is much larger than the number of samples. Moreover, the dual formulation provides the key to extend SVR algorithm to nonlinear functions.

Figure 2.1: The soft margin loss setting for a linear SVM

( Schölkopf and Smola 2002)

2.4.2 Dual Problem and Quadratic Programs

The main idea is to build the Lagrange function (L) from the objective function and its corresponding constraints, which has a saddle point with regard to the primal and dual variables at the solution (for more details see Vanderbei 1999). This can be done by introducing the dual set of variables. It proceeds as follows:

where are the Lagrange multipliers that have to satisfy positivity constraints, i.e.

The partial derivatives of L to the primal variables () are taken and equated to zero in order to achieve the optimality (Smola and Schölkopf 2004).

Replacing (2.23), (2.24), (2.25) and (2.26) into (2.21) yields the next dual optimization problem (Smola and Schölkopf 2004):

In deriving (2.27) the dual variables and are eliminated through conditions (2.25) and (2.26). Equation (2.24) can be rewritten to find the weight vector as

Thus, the regression function is represented as

It should be noted that the parameter b can be computed by exploiting the conditions of Karush–Kuhn–Tucker (KKT) (Ceperic et al. 2014; Keerthi et al. 2001). The KKT conditions to nonlinear programming generalize the method of Lagrange multipliers to allow inequality constraints in addition to equality constraints (Boyd and Vandenberghe 2004).

2.4.3 Generalized SVR Algorithm for Nonlinear Case

In the previous subsections, only the linear regression case has been discussed. The next step is to make the algorithm of SVM nonlinear. This could be realized by simply preprocessing the training patterns by applying the function. The function ? is used to transform the input space into some feature space where it is possible to apply the standard SVR algorithm (Ceperic et al. 2014; Smola and Schölkopf 2004). Unfortunately, this technique can easily become computationally expensive for both polynomial (high order) and high dimensionality features (Vapnik 1995). Obviously this approach is not suitable for all cases and we have to find a computationally cheaper way.

As noted previously, the SVM algorithm only depends on the dot products among patterns. Hence, Boser et al. (1992) have concluded that the key observation is suffices to know instead of explicitly,

which allows us to rewrite the SVM optimization problem (2.27) as follows

Likewise the equations (2.28) and (2.29) can be rewritten as

and

It should be noted that the difference between the two situations is that in the nonlinear case is no longer given explicitly like the linear case. Also note that in the setting of nonlinear case, the optimization problem corresponds to the finding of the flattest function in feature space rather than in the input space.

2.4.4 The Steps of SVR Algorithm

In this subsection, we illustrate the different steps of the regression algorithm graphically. Figure 2.2 illustrates that the input pattern is mapped into the feature space using the map function (Schölkopf and Smola 2002). Then, compute the dot products among the training patterns which are mapped previously by the map function. This corresponds to evaluating kernel functions. Next, the dot products are inserted using the weights. Finally, adding the parameter yields the final prediction output for the regression. It should be mentioned that the process which is described above is very similar to Neural Network Regression (NNR), with the difference that the weights in the input layer are a subset of the training patterns in the case of SV.

Figure 2.2: Architecture of a regression machine constructed by the SV algorithm (Schölkopf and Smola 2002)

2.5 Diagnostic Methods

The presence of outliers in the data set is considered as the most commonly problem in the real life applications. According to Hampel et al. (1986), the existence of 1% to 10% of unusual samples in a dataset is inevitable. Generally, for any data set there is no guarantee that it is free from unusual points. Some reasons that might cause for the presence of outliers such as recording mistakes, misplaced decimal point, and infrequent phenomena like earthquakes or overflows. The influential observation is an observation which lies far from the rest of the data and it has the high impact on the results of regression analysis (Rousseeuw and Zomeren, 1990). Due to its high impact on regression estimates, the robust regression technique is an important tool to analyze data which has outlying observations. It could be used for detecting, estimating and providing stable results in the presence of unusual observations. In general, we can distinguish between two main types of outliers, which are responsible for the model failure. Outliers with regard to the response variable are called “vertical outlier” and outliers with respect to the predictor are called “leverage points”. The high leverage points are also classified into two groups, good and bad leverage points. The good leverage points lie on regression line and they have small impacts on the regression estimators with a possibility of increasing the precision of an estimate. Whereas the bad leverage points lie far from the regression line and they have high effects on regression estimators. Figure 2.3 shows the various types of outliers for the simple regression model. The majority of the data points is regular or clean observations which are indicated by (a). Observation (b), (c) and (d) are considered outliers since those observations deviate from the bulk of the data. The observation (b) is a vertical outlier due to it outlying in -direction. Observations (c) and (d) are leverage points because they outlying in the direction. However, individually (c) is a good leverage point and (d) is a bad leverage point.

Figure 2.3: Classification of observations for simple linear regression

(Rousseeuw and Zomeren, 1990 )

Diagnostic techniques play an important role in the development and evaluation of multiple regression models, since they are calculated for the purpose of detecting influential observations. In the following section, the most practical diagnostic methods for identification of unusual data points such as vertical outliers and leverage points are reviewed.

2.5.1 Hat Matrix

The hat matrix which is symbolized by H plays an important role in identifying unusual points in the predictor space. The H matrix is defined as follows

Substituting (2.12) into (2.33) yields the fitted values

The diagonal elements H matrix (hat values) is demonstrated by

The above equation illustrates that the criterion measures the amount of weight given to relative to all the other values. It is clear that the number of the elements equals to the number of predictors plus one), whilst their values fall between 0 and 1. The average of this number is for. Whereas the threshold of is considered as twice or thrice of this average ( as given by Vellman and Welsh (1981), and the decision will be that any value larger than this threshold is identified as high leverage point. It should be noted that the existence of a big gap between the leverage values for observation is an additional evidence of contaminated data (Kutner et al. 2005). Further, the hat matrix might fail to identify the correct leverage points as it suffers from problem of masking some of outliers (Hadi 1992).

2.5.2 Robust Mahalanobis Distance

In order to clarify the Robust Mahalanobis Distance (RMD), the Mahalanobis Distance (MD) should be illustrated first. The Mahalanobis Distance is a measure of how far away the observation is from the center of the bulk of the data set (Rousseeuw and Leroy, 1987). For illustration purpose, let be the matrix of independent variables in the multivariate regression and

be the ith observation, where, n is the sample size and p is the

number of predictors. Hence, the arithmetic mean and the covariance matrixare defined as

Then the classical for each is defined as follows

The computed values are compared with the next threshold

Any observation that exceeds the threshold is considered as leverage points.

The can also be expressed based on hat values as

Since can be presented as a function of, it necessarily suffers from the same drawbacks of for detection of leverage points.

On the other hand, the Robust Mahalanobis Distance (RMD) was proposed for detection of leverage points (Rousseeuw 1985). The same formula of MD with the only difference that the parameters of location and scale are estimated based on robust methods such as the minimum volume ellipsoid (MVE) or the minimum covariance determinant (MCD). Thus, the Robust Mahalanobis

Distance can be presented for all as

where and are the robust location and scatter respectively, which come from the MVE or MCD methods. It should be taken into consideration that this alternative technique can identify the leverage points correctly but they have a tendency to detect too many normal points as leverages which is not desired too.

2.5.3 Principal Components

Principal component analysis (PCA) can be defined as a statistical procedure which uses an orthogonal transformation to transform a set of samples of probably correlated variables into a set of points of linearly uncorrelated variables named principal components. It is well known in the literature that the first few principal components are sensitive to the presence of outliers in the data set (Barnett and Lewis 1994). It can cause inflation of variance and covariance (and correlation, if the sample correlation matrix, rather than the sample covariance matrix is used in the analysis). Based on this, the use of the robust principal component is suggested for detecting outliers (Barnett and Lewis 1994).

In order to illustrate the standard principal components analysis of a sample of p-dimensional data, suppose we have

where is the matrix whose ith column consists of the transformed observations and is an orthogonal matrix whose columns are the eigenvectors of the sample variance-covariance matrix Then are the principal component coordinates and the ith row of, gives the projections on to the ith principal component coordinate of the deviations of the n original observations about. Thus the top few rows of, provide the means of investigating the presence of outliers affecting the first few principal components. The construction of scatter diagrams for pairs of among the first two principal components can graphically exhibit outliers. Unfortunately, there is no threshold in this procedure which leads to difficulty in making decision, especially for non-expert users in addition to the problems of masking and swamping.

2.5.4 The standard SVM Regression for Outlier Detection

The standard SVR technique for outlier detection (Jordaan and Smits 2004) uses the advantage of the Lagrange multipliers obtained via solving (2.30), with respect to Karush-Kuhn-Tucker conditions (KKT). This explains that at the point of the solution the output among the dual variables and constraints has to vanish.

If the slack variable is zero for any point, and does not possess the upperbound Lagrange multiplier or, thus the point of data is not suspected to be an outlier. In contrast, the data points that have upper bounds Lagrange multipliers and can be considered elected outliers. Generally, as various data points have upper bounds of Lagrange multipliers, it is necessary to find the real outlier. After several calculations of SVR values (2.32), the candidate with the highest frequency of suspected outlier with different values of, is considered an outlier. This procedure is repeated until no more points are detected as outliers, or until the mean square error (training error) becomes less than the predetermined threshold.

The following issues emerge when applying this technique to real phenomena. First, the process requires high calculation costs in order to handle data with multiple outliers, as the detection of an outlier requires numerous iterations of the optimization calculation. Second, it is difficult for non-expert users to operate, as it requires precise detection. Third, based on the SVM theory, SV algorithm has a unique advantage in that it constructs its structure (Chuang et al. 2002), implying the possibility of reducing masking and swamping issues when using various values of ? parameter.

2.5.5 SVR Based Outlier Detection

TheSVR method for outlier detection utilizes parameter (new regularization parameter) proposed by Nakayama and Yun (2006) to overcome the difficulties of standard approach (Jordaan and Smits 2004) instead of the parameter. TheSVR algorithm (Nishiguchi et al. 2010), has been developed as follows

In primary formation of theSVR method (2.44), only the highest training errors is evaluated, not the average slack variables utilized in the standard SVR (Jordaan and Smits 2004). The Lagrange function is built from (2.44) as follows

The following dual optimization problem results from the partial derivatives of function L, with regard to the primary variables () based on the condition of saddle point.

In this technique, the sum of the Lagrange multipliers is used instead of their values by solving (2.46), with respect to KKT conditions.

When the sum of the Lagrange multipliers (2.47) attains upper bound, all data points with non-zero (positive) Lagrange multipliers have the same maximum error. Consequently, these data points are probably the actual outliers. Based on the optimization concept, when an outlier exists, the point with the highest Lagrange multiplier is deemed the most probable outlier among points, whereby it is far from the majority of the data. From the dual problem (2.46) one can note that all Lagrange multipliers are not constrained by the upper bound. Multiple data points always have a different upper bounded Lagrange multiplier. Moreover, if the sum of the Lagrange multipliers is beneath the upper bound ?, all training errors less than the ? zone, or outliers, do not exist. Therefore, the sum of the Lagrange multipliers can be used for the outlier detection procedure. The ?-?-SVR algorithm for outlier detection can be stated as follows

Step 1: Calculate ?-? –SVR.

Step 2: Find the highest.

Step 3: Remove and which have the highest from the data set. Step 4: Repeat the process until there are no more outliers.

The following drawbacks appear when applying this method to real world applications. First, this approach is suitable for few outliers in the data, as it can detect and remove only one outlier per iteration. Thus, computational costs become close to those arising from the standard SVR method, when outliers are increased. Second, although it comes with fixed tolerance ?, which leads to a decline in masking and swamping problems, there is no clear rule for choosing the value of ? parameter.

2.6 Introduction to Robust Estimators

The other choice to remedy the problem of presence of outliers in the linear regression is the robust estimation procedure beside the diagnostic measures. It is well known that the classical linear regression methods such as OLS can be very sensitive to outlying observations (Anderson and Schumacker 2003). In this situation, several alternative robust regression estimators were proposed, which attempt to down weight or ignore those outlying observations. The major purpose of robust techniques is to provide useful information even if some of the assumptions are violated since all estimation methods depend on basic assumptions for their validity. In linear regression analysis, the robust regression methods are used to produce resistance estimates, which lead to the stability in the results in the presence of unusual data points (Huber, 1964; Hampel, 1974; Rousseeuw and Leroy, 1987; Wilcox, 2005; Marrona, 2006). In this section, briefly the basic concepts of robust regression approaches are presented.

2.6.1 Basic Concepts

The main purpose of introducing the robust regression technique is to provide resistant estimates in the presence of outliers as a part of the data set. The most fundamental properties which are used to measure the performance of robust methods in the theoretical sense are; efficiency, breakdown point and bounded influence. These concepts of robust estimators are summarized briefly as follows.

2.6.1.1 Efficiency

The efficiency can be defined as a measure of how well a robust method performs relatively to OLS method under basic assumptions or the degree to which estimator performs like OLS when all the assumptions are met. It is expressed as a percentage and can be computed as the mean squared error of the OLS fit on the clean data (without outliers) divided by the mean squared error of the robust fit (Maronna et al. 2006). Simpson (1995) pointed out that the efficiency near 90% to 95% relatively to OLS with clean data is desirable.

2.6.1.2 Breakdown Point

Another desirable feature of a robust method is a high breakdown point. The breakdown point (BP) is simply the smallest percentage of contamination which can completely destroy an estimator or implosion of the estimation method (Hampel, 1974, Coakley and Hettmansperger, 1993). In other words, it is the smallest fraction of bad observations (outliers) which can change an estimator dramatically. Generally, a high breakdown point means that the estimator has an ability to withstand a large percentage of outliers without implosion of the analysis. The highest possible BP is 0.50, because the estimate keeps bounded when less than 0.50 of the data are replaced by outlying observation (Rousseeuw and Croux, 1993). . In order to introduce a formal finite sample definition of breakdown we can use a sample of n data point such as

If we consider is a regression estimator, applying to such a sample yields a vector of regression coefficients as

Now replace any m of the original data points by arbitrary values (outliers) in order to obtain all possible corrupted samples. Thus the breakdown point of the estimator at the sample is defined as

where the supremum is over all possible data matrix include of observation and contaminated points (Rousseeuw and Leroy, 1987; Maronna et al. 2006).

2.6.1.3 Bounded Influence Function

The bounded influence function (BIF) indicates to an estimator that is robust for X space or high leverage points (Simpson, 1995). In other words, BIF is an ability to resist the effect of outlying points in the X space on the model estimators. The influence function (IF) measures the robustness of an estimator with respect to low contamination levels, which is commonly used to expose whether or not the estimator has BIF. The influence function of an estimator at a distribution in those points of the sample space where the limit exists can be given as follows

where is the probability distribution which puts all its mass in the point and is the amount of contamination . It is important to refer to the bias caused by adding a few outliers at the point which is reflected by the influence function (Rousseeuw and Leroy, 1987; Simpson, 1995; Wilcox, 2005; Maronna, 2006).

2.7 Robust Linear Regression

In this section various robust regression techniques are discussed as alternatives for classical methods when one or more of the basic assumptions are violated. The robust regression estimates have some advantages which are more efficient, stable and resistant in the presence of outliers. For this, we will illustrate briefly how the alternative techniques (robust) are developed to cover the drawbacks of classical methods (not robust) such as OLS method.

(Rousseeuw and Leroy, 1987; Wilcox, 2005; Maronna, 2006; Andersen, 2008).

2.7.1 M-Estimatior

The class of M-estimators is considered as one of the most popular robust technique in the linear regression model proposed by Huber (1973). The advantage of the M technique is that it can reduce the effect of abnormal points in the data through changing the sum of squares error in OLS using another function. The M-estimator is determined by minimizing the next function of residuals

where is a function that determines the contributions of each residuals in the objective function. The function should satisfy some conditions such as a positive definite (); a symmetric function (); and a unique minimum at zero ().

Since the solution of M-estimator is not scale equi-variant, the minimization issue is modified by dividing the -function by a robust estimate of scale,

Taking the first partial derivative of (2.53) with regard to ? and setting the result equal to 0, we get

where is the first derivative of which is called an influence function. This function used to give certain weight for each residual. The next step is replacing function in (2.54) by suitable weights that minimize the impact of high residuals as follows

The Iteratively Reweighted Least Squares (IRLS) method can be used to solve the non-linear equation in (2.55) to achieve the convergence for the estimated parameters (Beaton and Tukey 1974; Marrona, 2006).

where is a diagonal weight matrix with elements when the standardized form of residuals is demonstrated as.

Some of objective functions, influence functions and weight functions are available in the literature, such as Huber, Tukey and Hampel functions (Huber, 1964; Hampel, 1974; Andrews, 1974; Maronna, 2006). For instance, Huber functions (Huber 1964) have good properties which can illustrate as

where is a tuning constant and is an indicator function defined as

Another example of is the Tukey bisquare functions, which can be given as follows

where is a tuning constant and is an indicator function, defined as

It should be noted that the M-estimator is robust for outlying observation in yspace but it is sensitive to high leverage points in x- space (Huber, 2011). It is highly robust and has about 95% relative efficiency when the outlying observation is in y-space. Whereas, it has breakdown point of 0 when the outlying observation is in x-space.

2.7.2 GM1-estimator

Generalized M-estimators (GM-estimators, also known as GM1) were introduced because of the vulnerability of the M-estimators in the presence of leverage points (Hampel et al. 1986; Hill 1977). The basic idea which stands behind the GM1-estimators is to limit the influence of leverage points which is the main shortcoming of M-estimators. This can be done by making use of some weight functions which minimize of leverage points. The GM1-estimator is the solution of the following normal equation

where is an initial weight function used to minimize the effects of leverage points. The Equation in (2.65) can be solved by using the IRLS technique. Then, the GM1-estimator in convergence can be written as

where is a diagonal weight matrix with elements defined as

The initial weights of GM1-estimators that minimize the effect of leverage points in (2.65) are computed based on the hat values hii as

It should be noted that, the GM1-estimator has bounded influence properties and still possess the same efficiency of 95% and asymptotic distributional properties as M-estimator. Unfortunately, it has a breakdown point no higher than (1/p), which means the breakdown point is inversely proportional to the number of independent variables (Simpson 1995). Hence, by increasing the dimensionality, the breakdown point gets closer to zero. Furthermore, these strategies that can be used to reduce the effect of leverages in X direction are not very efficient because of leverage points might not easily show up in the corresponding diagonal elements when there are several leverage points (Rousseeuw ; Leroy, 1987). In an attempt to remedy these drawbacks of GM1estimators, many procedures were suggested. Perhaps the most important of these is the Generalized M-estimator (GM6).

2.7.3 GM6-estimator

To overcome the limitation of GM1-estimator, the GM6-estimator has been developed by Coakley and Hettmansperger (1993). The general procedure for GM6 is by choosing a good initial estimator such as LTS and applying many stages to achieve desirable properties. The initial weights of GM6-estimators that minimize the impact of leverage points in (2.65) are computed using the

RMD values based on MCD or MVE, which are as follows

The algorithm of the GM6-estimator is summarized in the next few steps

Step 1: Use the LTS method as an initial estimator to achieve a high breakdown of 50%, and calculate the residuals ().

Step 2: Calculate the estimated scale of the residuals (from step 1), by applying 1.4826 (the median of the largest (n-p) of the).

Step 3: Using the estimated residuals (), the initial weight () and the estimated scale (), to compute the standardized residuals (), where,

Step 4: Use the standardized residuals () in first iteration Weighted Least Squares (WLS) to estimate the parameters of the regression based on Equation in (2.66), where the weight is chosen based on Huber or any weight function.

Step 5: Calculate the new residuals and new weights from Step 4.

Step 6: The, is kept fixed from Step 2, Steps (3-5) are repeated until convergence.

Based on the algorithm of the GM6-estimator above, we can realize that the GM6 is more robust than GM1 because it depend upon LTS as an initial estimate, whereas, GM1 depend upon OLS. Further, the GM6 is more efficient than GM1 because it uses RMD based on MCD or MVE to determine the weights instead of hat values by GM1 (Andersen, 2008).

2.7.4 MM-Estimator

The MM-estimator is commonly used in the robust regression field which attempts to retain the robustness and resistance of S-estimator (Yohai 1987). It combines high breakdown point of 0.5 and supreme efficiency around 95% of the OLS efficiency when the basic assumptions are met. In the MM technique, the M-estimation process is applied two times to find the final estimates. As in the previous methods, the IRLS technique is used to get the final MMestimator. The MM-estimates procedure is summarized as follows

Step 1: Using a high BP estimator such as S-estimators to calculate the initial residuals ().

Step 2: Compute the M estimation of the scale of residuals which are calculated in step1.

Step 3: The residuals and estimation of the scale are employed in the first iteration of WLS to find the M-estimates of the regression parameters

(2.56)

Step 4: Compute new residuals and new weights from Step 3.

Step 5: The is kept fixed from Step 2, Steps (3-4) are reiterated until convergence.

2.8 Estimation of Standard Error Using Bootstrap Technique

The Bootstrap is a nonparametric technique, uses the resampling technique to build a sampling distribution and statistical inference for estimators (Efron 1992). The main idea of this approach is that it handles a sample as a population and repeatedly draws new samples from it with replacement. It should be noted that all original observations in the sample have the same probability of being drawn into the new sample. Then, for each newly drawn sample we can compute the statistics of desirable, such as the mean, variance and regression coefficients. The feature of this technique is that it can be used to compute the standard errors of some robust estimates, because of the difficulties to find the distribution of these estimators. There are two procedures for calculating the Bootstrap for robust regression coefficients, the random-X bootstrapping and the fixed-X bootstrapping which is considered in this thesis because we deal with fixed predictors in the regression models (Kutner et al., 2005 and Andersen, 2008).

2.8.1 Random-X Bootstrapping

The random-X bootstrapping procedure is used to estimate the standard errors for regression parameters. In regression model let we have a dependent variable and predictors, with a sample of n observations

. In random-x bootstrapping, we simply choose J

bootstrap samples of the with replacement, estimating the model and keeping the coefficients from each bootstrap sample. Thus, the estimation of bootstrapping regression parameters and its standard errors are calculated as follows

2.8.2 Fixed-X Bootstrapping

In this subsection, we are used the fixed-X bootstrapping procedure to estimate the standard errors for regression parameters since we deal with fixed predictors. The next few steps summarize the algorithm of the fixed-X bootstrapping procedure (Wilcox, 2005)

Step 1: Find the fitted values ? and the corresponding residuals, by fitting the robust regression model.

Step 2: Apply the re-sampling technique with replacement to the residuals calculated in step 1, in order to select J samples of size n randomly, which is called the bootstrap samples, where.

Step 3: Add the re-sampled residuals to the fitted values to find the J sets

of bootstrap fitted values as,

Step 4: Fit each sets of the J bootstrap fitted values on the fixed model matrix X and get J sets of regression parameters. The estimation of regression parameters is calculated based Eq. (2.70).

Step 5: The bootstrapping standard errors for the bootstrapping regression parameters is computed based Eq. (2.71).

2.9 The Single Index Model

In the last few years, a single index model (SIM) has become one of most popular approaches addressing the high dimensionality problems and lack of interpretability. It is well known that the parametric model has some assumptions and not easy to adapt with high dimensionality. In contrast, the nonparametric model features flexibility, but it suffers from less precision when covariates are increased ( Härdle et al. 2004). To cope with this issue, Ichimura (1993) developed the semi-parametric SIM, where its assumptions are weaker than those of a parametric model and stronger than those of a fully nonparametric model. The SIM alleviates some of the restrictive assumptions relative to the parametric models and avoiding some limitations of the fully nonparametric models such as the lack of extrapolation capability and difficulty of interpretation (Horowitz 2009). In addition, the SIM has the ability to achieve dimension reduction and avoids the so-called the curse of dimensionality and, thereby giving greater precision of the estimation than is possible with fully nonparametric estimation and at the same time maintain flexibility of a nonparametric model. Generally, we can say, the SIM combines the strengths of the parametric model with interpretability and the flexibility of the nonparametric approach (the reason is that the nonparametric part of the model is a function of only one variable). Horowitz (2009) pointed out that the single-index model is often easy to calculate, and its result is easy to interpret.

To simplify the analysis, SIM assumes that there is only one predictor factor that causes the systematic risk, and, thereby it summarizes the impacts of the predictor variables within a single variable or a single predictor called the index (Härdle et al. 2004).

2.9.1 Estimation

For illustration purpose, the next single index model is considered

where is the response variable, is the covariates vector, is the vector of unknown coefficients, is the unknown nonparametric link function.

When estimating the SIM, it should be taken into consideration that the functional form of the link function is unknown. Furthermore, since the shape of will define the value of regression model in (2.72), the estimation of the index coefficients will have to adapt to a specific estimate of the unknown link function to yield a correct regression value. Consequently, in the SIM the index and the unknown link function have to be estimated simultaneously, even though only the link function has nonparametric character.

Let, is a deviation of the response variable from its conditional expectation. Using, the single index model in (2.72) can be rewritten as follows

Our aim is to find efficient estimators for and. Since is inside the link function, the challenge which faces us is to find a proper estimator for, in particular one that can reach the – rate of convergence, the typical rate which is achieved by parametric estimators. For this purpose, we can use an iterative estimation of by semi-parametric least squares (SLS) or weighted semi-parametric least squares (WSLS), which are proposed by Ichimura (1993).

In general, the estimation procedure of the single index model can be summarized by the following three steps

Step 1: Estimate the vector of coefficients by.

Step 2: Calculate the index values.

Step 3: Using a (univariate) nonparametric technique to estimate the unknown link function

The moment that we have created the observations of the new independent variable, we have a standard univariate regression problem. This univariate problem should be solved using a possible and convenient nonparametric technique.

2.9.1.1 Semiparametric Least Squares

In this section, we concentrate on the estimate the coefficients vector as indicated in the introduction. The methods that are considered here under the semi-parametric least squares (SLS) have the common idea, that is establishing a suitable objective function to estimate the coefficients vector with parametric – rate of convergence. Certainly, inside the objective function we are using the conditional distribution of the response variable, or the link function, or both of them (Härdle et al. 2004). As the link function is unknown it needs to be replaced by nonparametric estimate. The objective function then should be minimized with respect to the parameters vector.

The semi-parametric least square (SLS) and the weighted semi-parametric least square (WSLS) have been proposed by Ichimura (1993). We concentrate here on SLS and later we generalize it to WSLS using the weight matrix. The objective function of the least squares method can be motivated by minimizing the variance of the data, which cannot be interpreted by the fitted regression. This variance can be written as

It is clear that the right side of the conditional variance in previous equation represents the following objective function

Next, employ the nonparametric procedure to estimate the unknown conditional expectation. As the index function is univariate, any univariate consistent smoother could be taken. Thus, the SLS estimator that minimizes the previous objective function is defined as

where is the trimming factor, and is the unknown link function estimator of . The trimming factor has been introduced to ensure that the density of the index is bounded away from zero, has to be selected accordingly.

On the other hand, this estimator can be generalized to the weighted version

WSLS via employing the weight matrix as follows

where is weight matrix that used to account for the possible presence of heteroscedasticity.

2.10 Variable Selection Methods

Variable selection, feature selection or variable subset selection is the statistical process that aims to select a subset of relevant variables (predictors) for use in the construction of the model. It solves the problem of deteriorating the generalization ability and directly minimizes the number of original predictors by selecting their significant subset which still retains the generalization ability compared with that of the original inputs. Recently, variable selection has become the focus of many researches in fields of application when the data sets consist of hundreds or thousands of variables. These fields include the analysis of texts, analysis of genes, and combinatorial chemistry. Moreover, this procedure is used where there are many variables and comparatively few observations (data points) such as the analysis of DNA data, where there are thousands of explanatory variables, and a few tens or hundreds of observations. The basic premise when using a variable selection procedure is that the data includes some variables which are either redundant or irrelevant, and thus can be removed without losing too much information (Bermingham et al. 2015). Redundant or irrelevant variables are two different concepts, in terms of one relevant variable may be redundant if there is another relevant variable so that they are strongly correlated (Guyon and Elisseeff 2003). In general, we can summarize three reasons or objectives for using of feature selection techniques as follows (Bermingham et al. 2015; Liu and Hu 2013)

Making the models simpler and easier to interpret by users or researchers.

Reduce the time consumption of training.

Reducing over-fitting problem (reduction of variance) to enhance the generalization ability of the model.

2.10.1 LASSO Method

The least absolute shrinkage selection operator (LASSO) method that was proposed by Tibshirani (1996) has received a great amount of attention in the statistics literature. The classical methods such as the ordinary least squares maybe yield estimates with large variance where two or more than two variables are correlated, which affects the generalization ability of the model. Hoerl and Kennard (1970a, 1970b) proposed the Ridge regression as an alternative technique to OLS method to overcome this problem. Unfortunately, this technique only shrinks coefficients to zero but never puts them to zero exactly. LASSO is an innovative variable selection method and considered as an alternative to this regression method since it can set an estimate to zero, and thus remove the corresponding variable from the active set of variables. The coefficients vector can be estimated by minimizing the next objective function.

Thus, the LASSO estimates of the coefficients vector can be calculated as follows

where , LASSO and OLS are given the same estimates, while if ,

LASSO tends to force some coefficients to be exactly equal zero, so that LASSO can select the variables with zero coefficients and consider them redundant variables contrary to Ridge regression method. However, the LASSO technique has some drawbacks which can be summarized as follows (Zou and Hastie, 2005)

In the case when p is larger than n, LASSO selects at most n predictors before it saturates, due to the nature of the convex optimization problem. This appears to be a limiting advantage for a variable selection method.

If there are very high pairwise correlations among a group of variables, LASSO tends to choose single variable from the group randomly without taking in consideration the importance of variables.

According to Tibshirani (1996), if there are high correlations among explanatory variables, the ridge regression will dominate the prediction performance of the LASSO.

2.10.2 Elastic Net Method

As we have mentioned before, LASSO technique has drawback which arises when the number of predictors exceeds the number of observations. For illustration purpose, the problem of the gene selection in microarray data analysis is considered here. The microarray data set consists of thousands of variables (genes) and often less than 100 samples. Some of these genes (as groups) share the same biological pathway, which makes the correlations among them inside each group can be high. LASSO method deals with each group of variables by selecting only one variable arbitrary and ignore the others without taking into account the importance of variables. According to Zou and Hastie (2005), the ideal gene selection technique should be able to achieve two things: eliminate the trivial genes and at the same time include all groups into the model the moment that selecting one gene among them (choosing one predictor from each group). For this issue when the number of predictors p larger than sample size n , the LASSO technique is not the optimal method, since it can only choose at most n variables out of p nominees and it lacks the ability to detect the grouping information (Efron et al., 2004). In order to strengthen further the predictive power of the LASSO technique, a new regression shrinkage and selection method has been proposed, named Elastic net (Zou and Hastie 2005). The advantages of this new method is that it works like the LASSO method whenever the LASSO does the best, and has the ability to solve the problems which are highlighted previously. Similar to the LASSO method, the proposed method simultaneously achieves automatic variable selection and continuous regression shrinkage as well as the ability to choose groups of correlated variables. Elastic net penalty function is a combination of LASSO ( norm) and Ridge penalty ( norm). The combination of these two methods provides the ability to overcome the limitation faced by LASSO, namely its inability to select more predictors than the exist observations in the dataset. The LASSO penalty tends to generate the sparse model and the Ridge penalty which make LASSO regularization path stable, grouping effect and no limited number of selected variables. The proposed Elastic net method aims to estimate the coefficients vector that minimizes the next objective function

Thus, the Elastic net estimator can be defined in the form of penalized least squares method as follows

where and are fixed non-negative and correspond to the LASSO and Ridge regression penalty functions, respectively which control the amount of regularization applied to the estimation.

The Elastic net penalty function (2.80) which combines Ridge and LASSO penalty functions is singular at zero and strictly convex without derivative. Based on Theorem 1 of Zou and Hasti (2005) the Ridge regression coefficients tend to equal to the corresponding highly correlated variables. This procedure motivates the author to consider these variables in group of highly correlated variables even though the ridge regression is not a variable selection procedure. On the other hand the second part of elastic net (LASSO penalty) is a variable selection technique.

CHAPTER 3

FIXED PARAMETERS SUPPORT VECTOR REGRESSION FOR OUTLIER

DETECTION

3.1 Introduction

The support vector machine (SVM) is currently a very popular technique for outlier detection as it is a nonparametric model and does not require the data to be of full rank. With a view to evaluate the approximate relationship among the variables, there is a necessity to detect outliers that are commonly present in most of natural phenomena before beginning to construct the model. The SVM has attracted the interest of many researchers in machine learning community and it has been successfully applied to regression problems (SVR) in addition to classification problems (SVC) (Yang et al. 2004). It is a universal technique for solving nonlinear, rank deficient and high-dimensional problems (Williams et al. 2011). The SVR employs the kernel trick, to transform the nonlinear relationship in the input space to a linear form in a high-dimensional feature space (Lahiri and Ghanta 2009; Üstün et al. 2006). Although SVR is a non-parametric approach, it is still affected by outliers, a common problem in real life as outliers may be selected as support vectors (Chuang et al. 2002).

In real-life applications, samples are always subject to noise, or outliers. Outlier can be defined as “An outlier is an observation which being atypical and/or erroneous deviates decidedly from the general behavior of experimental data with respect to the criteria which is to be analyzed on it” (Muñoz-Garcia et al. 1990). Outliers might occur due to several reasons, such as erroneous measurements, a phenomenon that appears in the tail part of some distribution functions. If the samples contained outliers or noise, the learning method may attempt to fit undesirable data, and this behavior might result in the approximation function going awry. This phenomenon is called “over-fitting” (Chen and Jain 1994; Suykens et al. 2002), and can negatively affect the testing error. Over-fitting phenomenon often leads to a loss of generalization performance.

Outliers are abnormal phenomena, and should be replaced or removed from the whole data before constructing the model. As the SVR technique usually deals with nonlinear cases of high-dimensional inputs, classical statistical analysis systems (such as linear regression) might fail to achieve the required performance. Further, detection of multiple outliers using standard diagnostics methods could result in masking and swamping. Masking occurs when outliers are incorrectly determined as normal observations, and swamping when normal observations are incorrectly considered as outliers (Pell 2000).

Recently, SVM has been applied for outlier detection (Cherkassky and Mulier, 2007). For instance, SVM for classification (SVC) was introduced to detect outliers, as mentioned by Jordaan and Smits (2004). The robustness of SVM with respect to outliers, and the reality that outliers are a portion of the support vector set, makes the technique potentially appropriate for outlier detection (Jordaan and Smits 2004). SVM holds a significant advantage over classical methods, as it is easy to control its parameters. However, the robustness of SVM alone is not sufficient to detect outliers without taking in consideration the number of observations that used to build the model and the type of transformation (the Kernel function).

Jordaan and Smits (2004) explored the use of SVR for applications of outlier detection, based upon the robustness of SVR. Unfortunately, this technique suffers some issues when applying it to real phenomena. First, the process requires high calculation costs in order to handle data with multiple outliers, as the detection of an outlier requires numerous iterations of the optimization calculation. Second, it is difficult for non-expert users to operate, as it requires precise detection. Third, based on the SVM theory, SV algorithm has a unique advantage in that it constructs its structure (Chuang et al. 2002), implying the possibility of reducing masking and swamping issues when using various values of ? parameter.

Later, SVR has been utilized to detect outliers for nonlinear functions with multi-dimensional inputs (Nishiguchi et al. 2010). This procedure is no better than standard SVR (Jordaan and Smits 2004), it also suffers from some drawbacks when applying it to real world applications. First, this approach is suitable for few outliers in the data, as it can detect and remove only one outlier per iteration. Thus, computational costs become close to those arising from the standard SVR method, when outliers are increased. Second, although it comes with fixed tolerance ?, which leads to a decline in masking and swamping problems, there is no clear rule for choosing the value of ? parameter. Although this method takes advantage of SVR, it is difficult for non-expert users to put this method to practical use, as the calculation costs are high, and tuning the free parameters may result in declining of masking and swamping problems.

In this chapter, our objective is to overcome these shortcomings of existing SVR methods for outlier detection. Generally, the proposed methods aim to introduce fixed parameters for SVR model and at the same time detect all the outliers in the data during one iteration. We provide SVR for outlier detection, taking into account the three angles (robustness, sparseness, and characteristics of kernel functions) for the solution triangle.

The rest of this chapter is organized as follows: in Section 3.2, a brief explanation of the proposed methods, fixed parameters ?-tube SV regression for radial basis function and linear kernel function. In section 3.3, the performance of the proposed methods is evaluated by using three real data sets for each method and compared with parametric methods such as principal components analysis (PCA) and robust Mahalanobis-distance (RMD) (Barnett and Lewis 1994; Jackson and Chen 2004). In Section 3.4, the performance and the reliability of the proposed approaches are tested using low and highdimensional simulation data sets. Finally, some concluding remarks about the proposed methods are given in Section 3.5.

3.2 Fixed Parameters SV Regressions

In order to promote the performance of the standard SVR to detect outliers, we suggest a practical procedure (fixed parameters ?-tube SV Regression) that takes into consideration all three angles (the type of transformation, sparseness and robustness). These three angles produce the so-called triangle of solution; therefore, any approach that depends only on one direction may not be the best. The efficiency of this technique lies in the fact that it requires less time than conventional methods and can detect abnormal points (outliers) without the need to remove them to allow for handling (for instance, minimizes their weights by following robust methods).

In the fixed parameters ?-tube SV Regression, we use the advantage of nonsparseness of the ?-insensitive loss function (the need of all samples). As shown in Ceperic et al. (2014) and Guo et al. (2010), if the value of threshold ? is very small, then the SV regression model depends on most of the training data, thereby making the resulting solution non-sparse. When the ? parameter is greater than zero, it is likely that some of the outliers are not considered as support vectors (fall inside the ?-zone), implying the need for further iterations for detecting outliers correctly. Practically, detection of outliers can be done by using the non-sparse ? -tube loss function (the value of ? parameter equal to zero). The non-sparse ? -tube loss function is defined as follows

Then, the convex optimization problem in (2.19) can be rewritten as follows

Consequently, the final regression function of the non-sparse ? -tube SVR and the weight vector could be represented by the next equation as

As demonstrated by Rojo-Álvarez et al. (2003), controlling the free parameters of SVM (C, ? and the kernel parameter ) gives the insensitivity for outliers, or it allows reduction in the impact of outliers in the solution. According to Üstün et al. (2005), the robustness of the SV regression of model (3.3) depends mainly on the choice of C value, because the highest and values, by definition of the Lagrange procedure, are equal to C value. More precisely, very high C produces in SVs with a high variance between and values, resulting in significant weights. The highest Lagrange multipliers belong to the unusual data point in the training data, considered an outlier (Jordaan and Smts 2004). The weight vector in (3.3) increases whenever the value of C is increased, as well as the presence of outliers. In this situation, it is easy to control the impact of outliers based on C value and the characteristics of kernel.

Another point for consideration is the characteristic of kernel functions. According to Williams (2011), the SVM algorithm is sensitive to the tuning choice (the type of kernel), so it is important to understand how kernel function works. The data generally follow two types of kernel functions: exponential radial basis function and the linear function, therefore, we are going to introduce two diagnostic methods appropriate for all types of data.

3.2.1 Proposed Method for Radial Basis Function (RBF)

The most common type of kernel is the Gaussian Radial Basis, which could be represented by the following equation

where is the explanatory variable, is the fractions of and is the bandwidth kernel function.

A quick look at the equation (3.4), we note that the output between the brackets is always a negative value, meaning that the RBF kernel is decreasing exponentially with a start point equal to the upper bound (equal to one). To explain this, let be an outlier, then, the RBF kernel value for pairs () will be equal to the upper bound as demonstrated below:

Hence, it can be concluded that the outlier point affects its line more than the influence on the other points and we expect the estimated value (3.3) for the outlier point to be the greatest value among the other estimated values. However, the differences would be not clear when the value of the parameter C is moderate, implying that we will get a low residual. To avoid this case, we can use very high weights (C=10000) to get an extremely high estimated value corresponding to the unusual sample. Thereby, the use of the highest errors will fail to detect the outliers because the estimated values are close to its real values. As a result, we would use the estimated values to detect the outliers.

According to the graphics, we can easily observe the points that are far from the majority of the data, however, there are still some difficulties facing the non-expert users. Thus, the criterion of the cut-off point should be used. In order to detect the outlier points correctly, we can utilise robust parameter location (the median) to separate the outliers and the majority of the data. Let we have any single variable, the maximum value of its observations is as follows

In order to separate the outlier points and the clean points, equation (3.6) could be used. However, to use this equation, one needs to estimate the value of parameter. Two things should be taken into account to estimate the value of parameter, the dispersion of the observations of the variable, and the value of the penalising parameter C. In this case, the standard deviation of the robust location parameter (the median) of the variable can be used as a predicted value for the parameter when the variable consists of the predicted values. Using the standard deviation here instead of the variance because we have low errors which correspond to high predicted values (Zong et al. 2006). Thus, the cut-off point could be explained as follows

As this approach involves detecting all the outlier points by applying it one iteration, the computational cost would be less than those of the conventional techniques. Additionally, it is suitable for non-expert users because it introduces fixed set of parameters. In the experimental result sections, the RBF kernel function (3.4) is utilized with (h 1, ? 0, C 10000), using the predicted values to detect outliers.

3.2.2 Proposed Method for Linear Kernel Function

In this section, we explore the use of the linear kernel function, or the firstdegree polynomial kernel, which is represented as follows

It is evident that this type of kernel is increasing in a linear manner, with starting point equal to one. However, this type of kernel works contrary to RBF function (previous section), as its minimum value is equal to the maximum value of the RBF function. It works in a different way, but the outlier still affects its row to a greater extent than the rest of the data, and we expect the predicted value (3.3) for outlier observation to be near its actual value, implying that a small error will occur. Therefore, to avoid this we can use extremely high weights (C=10000) to get very high estimations, implying that we will have the highest error corresponding to the outlier’s row, and the vector with the highest error will be considered an outlier. To detect the outlier accurately we can use the cutoff point (3.9), but we would use the variance of median instead of standard deviation because we expect soft errors when the parameter C is high (Zong et al. 2006).

As this method requires detection of all outliers by applying it at one time, the computational costs will be lower than those of previous methods. In addition, the free parameters can be defined properly using fixed values suitable for non-expert users. In the next two sections, we will use the linear kernel function (3.8) with (h 1, ? 0, C 10000), using estimated residuals to detect outliers.

3.3 Experimental Results for Real Data Sets

To show the performance of the proposed outlier detection techniques, we consider three real examples for each method which contains single and multiple outliers. The examples of the proposed method for RBF are the copper content in whole meal flour data, the international Belgian phone calls data, and the Hawkins, Bradu and Kass data. Whereas, the examples of the proposed method for linear kernel function are the first word of children data, the cloud point of a liquid data, and the stack loss data. These data sets were selected because numerous outlier detection studies have previously been conducted on them, and there is general agreement on which data points are the outliers (Rousseeuw and Leroy 1987; Williams et al 2002).

3.3.1 The Copper Content Data

The first example with two variables is the 24 observations of copper content in wholemeal flour that is sorted in ascending order. The last observation was considered an outlier, as mentioned in most previous studies (Maronna et al., 2006). This data set is presented in Appendix A1.

Figure 3.1(a) shows the results of applying the fixed parameters SVR graphically to the data set based on estimated values. Table 3.1 shows the result of applying the proposed method for outlier detection digitally. It is clear that the outlier point is detected correctly based on our proposed method. On the other hand, Figure 3.1 (b) illustrates the results of applying PCA and RMD methods. It can be seen that the PCA detects outlier point correctly, while the RMD fails to detect outlier.

Figure 3.1 (a): Detection of outlier based on the proposed method for copper content data

Figure 3.1 (b): Detection of outlier based on the PCA and RMD for copper content data

Table 3.1: The results of applying the proposed method for copper content data

Index Pred (9.48) Index Pred (9.48) Index Pred (9.48)

1 2.2000 9 3.0299 17 3.5999

2 2.2001 10 3.0297 18 3.6996

3 2.3999 11 3.1002 19 3.7000

4 2.3998 12 3.3699 20 3.7000

5 2.4994 13 3.4001 21 3.7003

6 2.7001 14 3.3998 22 3.7700

7 2.7997 15 3.4000 23 5.2804

8 2.8997 16 3.5004 24 28.949

3.3.2 Belgian Phone Calls Data

This data represented the total number of international phone calls in the whole Belgian that recorded between the years 1950 and 1973. Based on several studies, it contains six vertical outliers which are demonstrated in Appendix A2 (Rousseeuw and Leroy, 1987).

As shown in Figure 3.2 (a), the proposed approach is effective to determine outliers correctly by using predicted values. On the other hand, the results of applying the proposed method have been summarized in Table 3.2. We can see that outliers are detected correctly if compared with the cut-off points (it detected observations from 15 to 20 as outliers). In contrast, Figure 3.2 (b) is obtained based on the PCA and the RMD methods. It is clear that both of these techniques fail to detect outliers correctly. Furthermore, the PCA method suffers the problem of masking and swamping simultaneously, since it considered outliers as normal points, and normal points as outliers.

Figure 3.2 (a): Detection of outliers based on the proposed method for phone calls data

Table 3.2: The results of applying the proposed method for phone calls data

Index Pred Index Pred(64.5) Index Pred Index Pred

1 4.3998 7 8.0997 13 16.099 19 181.99

2 4.7003 8 8.7999 14 21.199 20 211.99

3 4.6996 9 10.600 15 118.99 21 42.999

4 5.8999 10 12.000 16 124.00 22 24.000

5 6.5998 11 13.499 17 141.99 23 26.999

6 7.3002 12 14.900 18 159.00 24 29.000

Figure 3.2 (b): Detection of outliers based on the PCA and RMD for phone calls data

3.3.3 Hawkins, Bradu and Kass Data (HBK)

The last example that has multiple variables is the HBK data set, an artificially constructed data with 10 bad leverage points (the first 10 observations) which affect the regression line and lie far away from it. Observations 11–14 are considered good leverage points that lie near the regression line (Hawkins, 1980). This data set is presented in Appendix A3.

As seen in Figure 3.3(a), the proposed method succeeded in detecting outlier points of the data set correctly using estimated values of the response variable. Moreover, the proposed method has the ability to detect bad leverage points whereas it considered good leverage points as normal points which means we will not lose more degrees of freedom of the model. This advantage of the proposed method can be clearly seen for good points 11-14 which lie in the side of the majority of data. In contrast, we can observe from Figure 3.3 (b) that both of the PCA and RMD methods succeeded to detect the bad leverage points (110). Unfortunately, they considered the good leverage points 11-14 as outliers which mean losing four degrees of freedom undeserved. For the results in detailed, Table 3.3 demonstrates the results of applying the proposed approach for outlier detection based on the predicted values of the dependent variable.

Figure 3.3 (a): Detection of outliers based on the proposed method for HBK data

Table 3.3: The results of applying the proposed method for HBK data

Index Pred Index Pred Index Pred Index Pred

1 9.6998 20 0.4000 39 0.7000 58 0.1000

2 10.099 21 0.9004 40 0.4997 59 0.2996

3 10.299 22 0.2998 41 0.1002 60 0.9000

4 9.4996 23 0.8000 42 0.7003 61 0.3000

5 10.000 24 0.6998 43 0.6001 62 0.5995

6 9.9999 25 0.3003 44 0.7000 63 0.2999

7 10.800 26 0.8002 45 0.4999 64 0. 4995

8 10.299 27 0.6999 46 0.3996 65 0.5997

9 9.5997 28 0.2999 47 0.9003 66 0.9004

10 9.8999 29 0.2998 48 0.0998 67 0. 6998

11 0.1998 30 0.2995 49 0.8998 68 0.5996

12 0.3999 31 0.0002 50 0.3997 69 0.2004

13 0.6997 32 0.4003 51 0.7003 70 0.6998

14 0.1000 33 0.6000 52 0.5001 71 0.1999

15 0.3996 34 0.7001 53 0.7001 72 0.1999

16 0.6001 35 0.2996 54 0.6998 73 0.4004

17 0.2002 36 0.9997 55 0.0003 74 0.8995

18 0.0002 37 0.6000 56 0.1002 75 0.2001

19 0.1004 38 0.9003 57 0.6999

Figure 3.3 (b): Detection of outliers based on the PCA and RMD for HBK data

3.3.4 First Word-Gesell Data

The first example is the first word Gesell adaptive score data. This data comes from Mickey et al. (1967), and has been extensively cited (Rousseeuw and Leroy 1987). The independent variable is the age (month) when a child utters its first word, and the response variable is its Gesell adaptive score of 21 children. According to Mickey et al. (1967), observation 19 is an outlier. This data set is given in Appendix A4.

The proposed approach for the linear kernel function was applied for the first word data, as the data follows this type of kernel. As shown in Figure 3.4 (a) and Table 3.4, the proposed method is effective in detecting the outliers correctly using estimated residuals. Figure 3.4 (b) shows the results of applying the PCA and RMD methods for this data set. It can be seen clearly that both of the PCA and RMD methods succeeded in detection of outlier, but at the same time the latter suffers the problem of swamping for five observations which are considered as outliers although they normal points.

Figure 3.4 (a): Detection of outliers based on the proposed method for First word-Gesell data

Table 3.4: The results of applying the proposed method for

First word-Gesell data

Index Resid (22.96 ) Index Resid (22.96 ) Index Resid (22.96 )

1 1.1818 8 0.7272 15 2.7272

2 7.8181 9 0.6363 16 0.6363

3 17.636 10 7.0000 17 7.0909

4 11.000 11 8.2727 18 0.9600

5 8.1818 12 6.0000 19 29.909

6 0.9025 13 17.636 20 13.272

7 3.2727 14 15.272 21 0.6363

Figure 3.4 (b): Detection of outliers based on the PCA and RMD for First word-Gesell data

3.3.5 Cloud Point Data

The next example is about the cloud point of liquid data. It is a measure of the degree of crystallization in stock that could be measured via the refractive index. It should be mentioned that this data comes from Draper and smith (1966), and there is general agreement based on numerous of researches that it contains only three outliers 1, 10 and 16. This data set is in the Appendix A5. According to the results that are illustrated in Figure 3.5 (a), we note that the proposed method for linear kernel function succeeded in finding the outliers correctly, using estimated residuals. For more details one can refer to the results summarized in Table 3.5. It can be noted that only points numbered 1, 10 and 16 have exceeded the threshold point. On the other hand, and as shown in Figure 3.5 (b), the PCA method detects only two outliers but it fails to detect the third observation. Furthermore, it considered two normal observations, 15 and 19 as outliers which means it suffers the problem of swamping in addition to masking. On the contrary, the RMD method completely fails to detect the correct outliers.

Figure 3.5 (a) Detection of outliers based on the proposed method for Cloud Point data

Table 3.5: The results of applying the proposed method for

Cloud Point data

Index Resid (0.739) Index Resid (0.739) Index Resid (0.739)

1 2.0667 8 0.1833 15 0.2333

2 0.5833 9 0.0999 16 1.3667

3 0.0005 10 2.2666 17 0.3833

4 0.1166 11 0.0999 18 0.1333

5 0.3666 12 0.6666 19 0.6166

6 0.1500 13 0.6333

7 0.3333 14 0.0005

Figure 3.5 (b) Detection of outliers based on the PCA and RMD for Cloud Point data

3.3.6 Stack Loss Data

The last example in this section is the stack loss data (Rousseeuw and Leroy 1987). The data describe the process of a plant oxidizing ammonia to nitric acid, and consists of 21 observations, measured once a day with three explanatory variables. In many studies, points 1, 3, 4, and 21 are considered outliers (Rousseeuw and Leroy 1987). This data set is given in the appendix A6.

In order to explore the effectiveness of the fixed parameter SVR for outlier detection in comparison with classical methods, the stack loss data has been used. The results of employing the proposed method to achieve identification of outliers are summarized in Figure 3.6 (a) and Table 3.6. A quick look at the Figure 3.6 (a), we can realize that the proposed method completely succeeds in detecting outliers, by applying it only one time by taking in consideration the concept of threshold which separated outliers from the majority of data. It can be noted that only observations 1, 3, 4 and 21 have residuals greater than the cut-off point value which make them classified as outliers. In contrast, Figure 3.6 (b) illustrates how the PCA fails to detect correct outliers, while, the RMD succeeds to detect three outliers out of four. For this example, we can conclude that the PCA and RMD suffer from both masking and swamping problems.

Figure 3.6 (a) Detection of outlier based on the proposed method for Stack loss data

Figure 3.6 (b): Detection of outlier based on the PCA and RMD for Stack loss data

Table 3.6: The results of applying the proposed method for Stack loss data

Index Resid (3.426) Index Resid (3.426) Index Resid (3.426)

1 5.0719 8 0.0121 15 1.1792

2 0.0120 9 1.4730 16 0.0005

3 5.4389 10 0.0005 17 0.4177

4 7.6283 11 0.5387 18 0.0016

5 1.2148 12 0.0572 19 0.4831

6 1.7932 13 2.8806 20 1.6226

7 1.0121 14 1.7999 21 9.4588

3.4 Artificial and Simulation Studies

We consider here three types of simulation studies, where the first simulation is for testing the proposed method for outlier detection in the case of high dimensional data. The second simulation example is for evaluating the performance of the proposed method to detect outliers for rank deficient data. The last example in this section is to check the reliability of the proposed method for detecting the correct number of outliers taking into consideration the problems of masking and swamping. The reliability of the proposed method is compared with the RMD method only because the PCA has not cutoff point which makes it difficult to separate outliers than the majority of the data during the replications.

3.4.1 First Artificial Data

To illustrate the performance of the proposed method, we consider the general linear regression model in (2.3) when the number of samples, n=25 and the number of predictors, p=10. In this example, the observations, are generated from a uniform distribution U (1, 2), while is sampled from a normal distribution N(0,1). In this example, two outliers and two leverage points are represented by replacing four observations (1, 5, 10, and 20) with normal distribution N(35,1). The first two points 1 and 5 are the vertical outliers in direction and the second two points 10 and 20 are the bad leverage points for both and variables. The proposed method for linear kernel function is applied in comparison with the PCA and RMD method.

Figure 3.7 (a) and Table 3.7 explain outlier detection based upon the proposed method for linear kernel function. We can see that all outliers (vertical and leverage points) are detected correctly without swamping or masking problems. On the other hand, Figure. 3.7 (b) indicates that the PC method can not detect both of the vertical outliers and leverage points. In contrast, the RMD fails to identify vertical outliers and succeeds to detect only one point out of two leverage points. Furthermore, it considers some normal points 6, 8, 11, 15 and 19 as outliers. At the end of this example, we can conclude that both PCA and RMD methods are not suitable for high dimensional data especially the latter since its threshold depends on the number of predictor variable p.

Table 3.7: The results of applying the proposed method for first artificial data

Index Resid(12.8) Index Resid(12.8) Index Resid(12.8)

1 31.656 10 20.491 19 1.1712

2 1.5217 11 0.0091 20 19.858

3 1.6049 12 0.0194 21 0.0462

4 1.0394 13 0.0138 22 0.1743

5 30.710 14 1.1499 23 1.1176

6 0.0169 15 0.0063 24 0.0176

7 0.4823 16 0.0288 25 0.0161

8 0.7589 17 1.6510

9 0.0308 18 0.0054

Figure 3.7 (a): Detection of outliers based on the proposed method for first artificial data

Figure 3.7 (b): Detection of outliers based on the PCA and RMD for first artificial data

3.4.2 Second Artificial Data

In order to evaluate the performance of the proposed method for rank deficient data, this example which is based on the general linear regression model in (2.3) is considered when the number of predictors is greater than sample size , where the values of are generated from a normal distribution N (0, 1) (Friedman et al. 2010). In this example, three bad leverage points for both and variables were represented by replacing three observations (1, 2, and 3) with arbitrarily large numbers equal to 50. The proposed fixed parameters method for RBF kernel function is applied to detect outliers.

Figure 3.8 (a) and Table 3.8 explain outlier detection based upon the proposed method using the predicted values. The results show the superiority of the proposed method in comparison with the PCA for detecting all leverage points. Figure 3.8 (b) indicates that the PCA can detect only one leverage point, while it fails to identify the other two leverage points. It should be mentioned that the RMD is not considered in the comparison for this example, since it deals with full rank data only. This means that the RMD method is not suitable for rank deficient data in addition to high dimensional data since its cut-off point depends mainly on the number of predictors p.

Figure 3.8 (a): Detection of outliers based on the proposed method for rank deficient data

Figure 3.8 (b) Detection of outliers based on the PCA for rank deficient data

Table 3.8: The results of applying the proposed method for rank deficient data

Index Pred(9.97) Index Pred(9.97) Index Pred(9.97)

1 20.366 10 5.0239 19 1.9822

2 17.949 11 4.3579 20 0.3053

3 16.120 12 3.6194 21 6.0516

4 6.9460 13 2.0315 22 0.0742

5 4.0734 14 1.5434 23 0.4442

6 5.0856 15 7.6742 24 1.7260

7 3.2982 16 6.8492 25 2.3779

8 3.4331 17 2.1920

9 9.9335 18 1.0154

3.4.3 Simulation Data

In this section, we report a simulation study to assess the reliability of the proposed fixed parameters SVR method and compare it with the RMD technique in terms of the correct identification, masking and swamping problems. The evaluation of these techniques is based on the rate of correct detection of bad observations and the rate of masking and swamping effects. A good approach is the one that has higher percentage of correct detection of bad leverage points with smaller rates of masking and swamping. Here, experiments are designed for two sets of explanatory variables based on linear and nonlinear models. The first set is based on the nonlinear model in (3.10) with two predictors, while the second set is based on the general linear regression model in (2.3) with three predictors (Alguraibawi et al. 2015;

Cherkassky and Ma 2004)

The explanatory variables are generated randomly from a uniform distribution with mean zero and variance one, while the additive residuals are generated from standard normal distribution. In each experiment, different size of samples ( = 20, 40, 100 and 150) and different percentages of contamination ( = 0.05, 0.10, 0.15 and 0.20) are used. The bad leverage observations are generated based on the position of the first observations for both and variables. In order to generate these points, the first observations in each explanatory variable is kept fixed at, which appear later in the dependent variable based on the used model automatically. The comparison results based on 1000 replications of this simulation study are summarized in Tables (3.9) and (3.10). These tables demonstrate the percentage of correct detection of bad leverage points, and the rates of masking and swamping for all possible combinations of p, n and .

It is interesting to note the results from Tables (3.9) and (3.10) that the proposed FP-SVR method consistently displays higher rate of detection of BLP with almost negligible swamping and masking rates for all combinations of values of p, n and . On the other hand, the RMD presents higher rate of detection of BLP but its swamping effect is very high compared with the proposed method which indicates the superiority of the proposed method. The results of the study show that the proposed FP-SVR technique has performance much better than the RMD method.

Table 3.9: Percentage of correct identification of BLP, masking and swamping for simulation data with two predictors (p=2)

206045-59816

% Correct detection % Masking % Swamping

n

RMD FP-SVR RMD FP-SVR RMD FP-SVR

20 100 100 0 0 30 0.23

40 100 100 0 0 10.6 0.50

5%

100 100 100 0 0 11.7 0.43

93.3 93.3 6.7 6.7 1.250.29

0.01

100 100 0 0

100 100 0 0

100 100 0 0

100 100 0 0

100 100 0 0

100 100 0 0

100 100 0 0

97.8 97.8 2.2 2.2

40 5.9 0.08

10%

100 10.4 0.05

0.710.03

0.01

40 4.6 0.01

15%

100 4.4 0

0.890

100 100 0 0 0

20% 40

100 100

100 100

100 0

0 0

0 6.2

1.1 0

0

150 100 100 0 0 0.7 0

Table 3.10: Percentage of correct identification of BLP, masking and swamping for simulation data with three predictors (p=3)

206045-59815

% Correct detection % Masking % Swamping

n

RMD FP-SVR RMD FP-SVR RMD FP-SVR

20 100 100 0 0 30 0

40 100 100 0 0 9.5 0

5%

100 100 100 0 0 5.6 0.01

2060450150

20

150

20

94.3 94.3 5.7 5.7 0.8 0

100 100 0 0 5 0

40 100 100 0 0 4.4 0

10%

100 100 100 0 0 2.3 0

2060450150

20

150

20

100 100 0 0 0.7 0

100 100 0 0 10 0

15% 40 100 100 0 0 2.8 0 100 100 100 0 0 0.7 0

150

97.7

97.7

2.

3

2.

3

0.8

0

20

150

97.7

97.7

2.

3

2.

3

0.8

0

20

100 100 0 0 0 0

40 100 100 0 0 3 0

20%

100 100 100 0 0 0.1 0

150 100 100 0 0 0.7 0

3.5 Conclusion

We have proposed practical techniques for detecting vertical outliers and bad leverage points for low and high-dimensional data sets (either full rank or rank deficient), using fixed parameters SVR technique. The effectiveness of these methods is tested on both real and simulation data sets. The applications of real and simulation data sets showed that the proposed approaches have advantages over earlier SVR methods because they minimizes computation cost and have introduced fixed set of parameters, making them suitable for non-expert users. Further, the proposed methods consistently displayed higher rate of detection of leverage points (almost 100%) with smaller swamping and masking rates compared with RMD.

CHAPTER 4

A HIGH BREAKDOWN, HIGH EFFICIENCY AND BOUNDED

INFLUENCE MODIFIED GM ESTIMATOR BASED ON SUPPORT

VECTOR REGRESSION

4.1 Introduction

In regression analysis, we fit the dependent variable to independent variables to achieve the optimal model. This optimal model can be used to predict the phenomenon under study, which leads to the correct decision. A widely used method to estimate the relationship between the variables is the ordinary least squares (OLS) method. Unfortunately, this technique is sensitive to the presence of outliers in the data, which leads to misleading conclusions. An outlier can be defined as “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980). There is presently a wide awareness of the risks posed by the occurrence of outliers in data ( Rousseeuw & Leroy, 1987). Hubert et al. (2008) pointed out that estimates using conventional methods differ substantially from the “right” answers (estimates that have been obtained without the outliers). Outliers occur due to various reasons: misplaced decimal points, recording or transmission errors (measurement errors), and exceptional phenomena such as strikes or earthquakes, or observations of a different population that have slipped into the sample (Rousseeuw & Leroy, 1987). In regressions, outliers are classified into two categories: outliers in the Ydirection and outliers in the -direction (which we often call leverage points). The outliers in the -direction are further classified into two groups: bad and good leverage points (this description being based on their impact on the regression model). High leverage does not necessarily mean that the outlier influences the regression coefficients. It is possible to have leverage points (good leverage points) that still follow the pattern of the rest of the data.

Typically, there are two useful measures that can be employed to tackle the presence of outliers in the data (in either theor the direction): regression diagnostics (Cook, 1977) and robust regression methods (Huber, 1973). Both of these procedures aim to find robust estimators, but they do so in different ways. The regression diagnostics procedure is based on cleaning the data by applying some robust rules for the detection of outliers (fitting the data and then investigating potential outliers) and then using classical estimation methods on the remainder of the data set. Technically, regression diagnostics give very low weights that are equal to zero. On the other hand, robust regression is concerned with developing estimators that are not powerfully affected by outliers and are not sensitive to the noise distribution (Calafiore, 2000). This procedure can be done in two stages: in the first stage, a model is built that is appropriate for the bulk of the data, and in the second, those observations that were classified as outliers are given low weights.

The robust regression procedure has been found to be a suitable alternative to OLS (Hampel, 1974). It is basically aimed at providing stable estimates in the presence of outliers. This technique works to find robust estimators by applying specific steps many times until convergent estimators are obtained. Moreover, most of the robust methods use outputs from the first technique (diagnostics) as their initial inputs. However, the effectiveness of these methods strongly depends on the concept of the breakdown point (BP). Any regression method has a breakdown point, which is defined as the smallest fraction of outlier contamination that might force the value of the estimate to be outside an arbitrary range (Calafiore, 2000). Yohai and Zamar (1988) stated that one of the goals of robust regression is to achieve a high breakdown point of nearly 50%, a bounded influence function and high efficiency, simultaneously. The OLS estimators have high efficiency, but their breakdown points are zero, which means that one outlier is sufficient to carry the estimate over all predefined bounds (Maronna et al., 2006). This reflects the extreme sensitivity of the OLS method to outliers, which occurs because it was built under the condition of a normal distribution of the error.

Different robust regression methods have different abilities to protect against the class of outliers. One of the oldest robust methods is the least absolute values (LAV) or -norm, which minimizes the sum of absolute residuals. Although LAV achieves high efficiency in comparison with OLS, it still has a low breakdown point (1/n), which makes it less attractive than the other robust regression methods (Armstrong & Kung, 1978). The reason behind this limitation is that it does not handle outliers in thedirection (Bagheri et al., 2010). Huber (1973) suggested another class of robust methods, namely the Mestimators. Although these estimators achieve high efficiency, their breakdown points are still very low as they are equal to (1/n) (Simpson, 1995). Clearly, they are successful at overcoming outliers in the direction, but at the same time they are still affected by outliers in the direction (Rousseeuw and Leroy, 1987). Because of the vulnerability of the M-estimator in the presence of leverage points, generalized M-estimators (GM-estimators, also known as GM1) were introduced by Schweppe, as described by Hill (1977). Their basic purpose is to make a bound for the influence of outliers in the direction by using some weight function (Rousseeuw and Leroy, 1987). According to Simpson (1995), this approach has high efficiency and bounded influence properties. Unfortunately, it has a breakdown point no higher than (1/p), which makes it robust when the data has a small fraction of outliers; the breakdown point is inversely proportional to the number of independent variables. Rousseeuw (1984) proposed the least median of squares (LMS) approach, which has a high breakdown point, to overcome the limitations of the previous methods. This technique is based on minimizing the median of the squared residuals. Unfortunately, it suffers from an abnormally slow convergence rate, and it has a low breakdown point when the sample contains clustered outliers (Hekimo?lu & Erenoglu, 2013). The least trimmed squares (LTS) method was suggested as this achieves a high breakdown point equal to 50% (which is the highest possible breakdown value) (Rousseeuw, 1985). However, a high breakdown point does not on its own constitute the optimal solution – one must have regard to the influence function and the efficiency of the robust method. Both the LTS and the LMS approaches produce unbounded influence functions, and a low relative efficiency of about 8% and 37%, respectively, is reached (Rousseeuw, 1984; Rousseeuw, 1985; Rousseeuw & Croux, 1993; Stromberg, Hössjer, & Hawkins, 2000). The S estimator of Rousseeuw and Yohai (1984), and the MM estimators of Yohai (1987) satisfy the conditions of a high breakdown point and high efficiency, but do not address the concept of bounded influence (Hekimo?lu & Erenoglu, 2013). Coakley and Hettmansperger (1993) proposed a new class of GM-estimator (GM6), with a high breakdown point roughly equal to 50%, bounded influence and high efficiency for the normal model, to overcome the limitation of the GM1-estimator, by using OLS as an initial estimator. Regrettably, this considers the good leverage points to be bad leverage points, which means that its efficiency tends to decrease with the presence of “good” leverage points. Gervini and Yohai (2002) proposed robust and fully efficient regression estimators to handle the presence of outliers. She and Owen (2011) used nonconvex penalized regression to achieve both of identifying outliers and estimating regression coefficients. A robust mixture modelling approach has been proposed (the mean-shift formulation mixture with non-convex sparsityinducing penalization) to achieve simultaneous detecting outliers and estimating parameters (Yu, et al. 2015).

In this chapter, we propose a modified GM6 based on a fixed parameters support vector regression technique, and we call this GM-SVR. We show in this study that the proposed estimator achieves a high breakdown point, bounded influence and high efficiency, taking into consideration the presence of “good” leverage points.

This chapter is organized as follows: in Section 4.2 we give a brief description of the modified method of Coakley and Hettmansperger (GM6). In Section 4.3, we apply the modified method to real and artificial data sets. The proposed method is examined on linear regression models by using the Monte Carlo methods, in Section 4.4. Finally, concluding remarks are given in Section 4.5.

4.2 Proposed GM-estimator Based on Fixed Parameter SVR

In this section, the algorithm of the proposed new version of GM-estimator is explained in addition to the initial weights of the compared methods, GM1 and GM6.

4.2.1 Choice of the Initial Weights of GM1 and GM6

In order to illustrate the initial weights of the existing methods GM1 and GM6 we consider the general linear regression model in (2.3). In general, the GMestimator is the solution of the following normal equation

where is -th initial weights, is the influence function, is -th element of the response variable, is -th element of explanatory variable , ? is estimation of the vector of parameters, and is the estimation of the scale parameter.

The first main problem is to identify potential outliers and leverage points in the data. There are various techniques to find leverage points in the data. One of them is to analyse the so called hat matrix or hat values that was used to carry out GM1. The square root of the diagonal elements of the hat matrix is used to define the -weight function (2.68) of this estimator. According to this criterion, leverage points are considered to be the largest diagonal elements of the hat matrix. However, leverage points might not easily show up in the corresponding diagonal elements when there are several leverage points (Rousseeuw & Leroy, 1987).

As for the other method GM6, the robust Mahalanobis distance (RMD) based on MCD or MVE is used to identify the leverage points in the data set, as explained by the equation in (2.40). Any data point with greater than is considered to be a leverage point. The initial weights that are used to minimize the effects of the leverage points are calculated using the formula in equation (2.69). However, this deals with “bad” leverage points as well as “good” leverage points, which reduces the efficiency of GM6.

4.2.2 Choice of the Initial Weight of the Proposed Method

The main idea of our modification is to find the outliers and “bad” leverage points, which lie far away from the majority of the data, first, and then to minimize their weights, before using M-steps for all of the data. As mentioned before, the GM6 approach uses RMD to detect leverage points (both good and bad ones) in the data set. Clearly, it minimizes the weights of bad leverage points, but at the same time it reduces the weights of good leverage points, which decreases its efficiency whenever good leverage points are present. For this purpose, fixed parameters support vector regression (FP-SVR), as proposed by Dhhan et al. (2015), is performed to detect outliers as well as bad leverage points. This is a non-parametric approach that follows statistical learning theory and the principle of structural risk minimization (SRM). These authors succeeded in introducing a new technique that is insensitive to outliers, by controlling the free parameters (, ? and the kernel parameter h). The fixed parameters SVR ( =100000, ? =0 and h=1) can be explained by the following SVR model

where and are the Lagrange multipliers, is the kernel function and is the bias term of the SVR model.

According to Rojo-Álvarez et al. (2003), controlling the values of the SVR parameters allows to reduce the impact of outliers in the solution. Further, the highest Lagrange multipliers ( and) belong to the abnormal point in the data set that is deemed an outlier. Since the values of the Lagrange multipliers depend on the value of the parameter, any change to the parameter value is necessarily affecting the values of Lagrange multipliers. The flexibility of the nonparametric model ensures a significant change in the estimation value corresponding for the anomalous point with a slight change to the estimation values corresponding for the rest of the data. As for the parameter, this means the use of all observations, which means the possibility of the diagnosis of outliers in the first iteration. In contrast, the previous methods (SVR methods for outlier detection) require several iterations because these methods used part of observations in the analysis. The value of parameter is to get linear change estimations whenever the value of the parameter is changed. Thus, the authors point out that the very high value produces a high variance between and values, resulting in a significant value of z belonging to outliers, which can be identified using the cut-off point criterion. Any point of the data that achieves a value greater than the cut-off point (4.3) is considered to be an abnormal point (outlier or bad leverage point).

In this way, the proposed technique successfully prevents any disturbing effects of outliers and leverage points (bad only) on the estimated parameters.

4.2.3 Algorithm of the Proposed Estimator GM-SVR

In this part, we show the steps of the algorithm for our proposed approach. We make a slight modification to the algorithm of the GM6 method. This modification is done by using initial weights that depend on the FP-SVR instead of the RMD, since this has more asymptotic efficiency to detect outliers and bad leverage points (Dhhan et al., 2015). The algorithm for the proposed work can be summarized by the following steps:

Step 1: Use the LTS method as an initial estimator to achieve a high breakdown of 50% with a rate of convergence, and calculate the residuals ().

Step 2: Calculate the estimated scale of the residuals (from step 1), by applying the next formula

.

Step 3: Using the estimated residuals () and the estimated scale (), find the standardized residuals (), where,

Step 4: Compute the initial weight based on FP-SVR (6), where .

Step 5: Use the initial weight (step 4) and the standardized residuals (step 3) to achieve a bounded influence function for bad leverage points,

.

Step 6: Use the weighted residuals () in first iteration WLS to estimate the parameters of the regression based on, where the weight is small for large residuals to get good efficiency (Tukey weight function is used in this chapter).

Step 7: Calculate the new residuals () from WLS and repeat steps (2-6) until the parameters converge.

It should be mentioned that the procedure is affine equivariant which means under an affine transformation of the design points; the estimator is transformed in the same way without loss of generality. For instance, we have the transformation, our model (1) becomes.

4.3 Artificial and Real Case Studies

In this section, we apply the proposed method (GM-SVR) to Hawkins-BraduKass and aircraft data (Rousseeuw and Leroy, 1987), and compare the results with those from traditional methods (OLS, M, MM, GM1, and GM6). In order to evaluate the proposed approach, two criteria are used. We use the standard error of the estimated parameters (SE) (Bagheri et al., 2010), and the median absolute deviations (MAD) of the residuals, based on a re-sampling of the residuals using the bootstrap technique. The number of bootstrapped samples that are used to evaluate the robust methods is 1000 samples. On the other hand, low probability is given to fix the percentage of outliers.

4.3.1 Hawkins-Bradu-Kass Data

In order to illustrate the merits of the proposed technique, we use HawkinsBradu-Kass data. Such artificial data offer the advantage that the locations of the bad points, as well as the good points, are known exactly, which avoids some of the disagreements that are inherent in the analysis of real data sets. In this case, the effectiveness of the new technique can be measured in terms of the way that it deals with these points. The data set consists of four dimensions (one dependent and three explanatory variables), with 75 observations. It is constructed with ten bad leverage points (the first 10 samples), which lie far away from the regression plane, and four good leverage points (samples 11–

14), which lie close to the regression plane. For illustration purposes, Figure 4.1 shows the two approaches (RMD and FP-SVR) that have been used to find the initial weights in GM6 and GM-SVR respectively. This data set is presented in the Appendix A3.

From Figure 4.1, we can see that the RMD method gives small weights to the leverage points (good and bad), since it detects observations 1–14 as abnormal data, whereas the FP-SVR approach succeeds in detecting only observations 1– 10 as abnormal data, which means that it gives a normal weight (equal to one) to the good leverage points (samples 11–14).

RMD FP-SVR

IndexIndex

Figure 4.1: Detection of leverage points based on RMD and FP-SVR for HBK data

Table 4.1: The summary results based on different regression methods for HBK data

Methods Estimations Intercept MMAD

OLS Parameters Boot.SE -0.3875

0.4087 0.2391 0.2580 -0.3345

0.1484

M Parameters Boot.SE -0.7798

0.1601 0.1663

0.0968 0.0118

0.0638

MM Parameters Boot.SE -0.1896

0.1012 0.0852 0.0631 0.0410 –

0.0368

GM1 Parameters Boot.SE -0.9382

0.1242 0.1440

0.0745 0.1942

0.0467

GM6 Parameters Boot.SE -0.0094

0.1812 0.0655

0.0638 0.0124

0.0631

GM.SVR Parameters Boot.SE -0.1887

0.0892 0.0848

0.0579 0.0409

0.0341

The results of the comparison of the different robust methods are recorded in Table 4.1. This comparison clearly shows the superiority of the proposed method (GM-SVR) over the other robust methods, in terms of the standard deviations of the estimated parameters and MAD. This leads us to conclude that, for this data set, the use of GM-SVR is recommended.

4.3.2 Aircraft Data

This real data example uses 23 observations of single-engine aircraft built over the years 1947-1979. The response variable (dependent) is cost in units of $100,000 and the independent variables are aspect ratio, lift-to-drag ratio, weight of the plane, and maximal thrust. This data set consist of two influential observations, 16 and 22 (Rahmatullah Imon 2005). This data set is given in the Appendix A7.

Table 4.2 displays the results of the robust methods in addition to the OLS method for the aircraft data. It can be noted that the proposed method is more effective than the other methods, by looking at the standard error of the parameters. The proposed method produces a smaller standard error for the parameters than the other methods. On the other hand, the proposed method has the smallest value of MAD among the other robust and non-robust methods. According to these results, the use of GM-SVR is suggested for this data set.

Table 4.2: The summary results based on different regression methods for aircraft data

Methods Estimations Intercept MMAD

OLS Parameters Boot.SE -3.7913

8.8086 -3.8529

1.5391 2.4882

0.9978 0.0034

0.0004 -0.0019

0.0004 5.9348

M Parameters Boot.SE -1.2850

7.7345 -3.4213

1.3415 2.2159

1.0060 0.0029

0.0004 -0.0015

0.0004 4.9820

MM Parameters Boot.SE 6.1417

5.7005 -3.2305

1.0221 1.6711

1.0940 0.0019

0.0003 -0.0009

0.0003 3.3744

GM1 Parameters Boot.SE 5.0283

5.6153 -3.2523

0.9846 1.8168

1.0199 0.0020

0.0003 -0.0010

0.0003 3.4977

GM6 Parameters Boot.SE 9.9272 6.6729 -3.3651

1.0751 2.4270 2.1337 0.0014 0.0004 -0.0007

0.0004 2.9409

GM-SVR Parameters

Boot.SE 10.9837

3.6515 -3.4207

0.6326 1.3446

1.0372 0.0015

0.00023 -0.0007

0.0002 0.9583

4.4 Monte Carlo Simulation Studies

In order to investigate the efficiency of the proposed GM-SVR method, we compare this method and certain robust methods for two different models. These two models are sampled according to the general linear regression model given in equation (2.3). The true value for all the parameters

is fixed at 1. In this section, four different criteria are used to inspect the efficiency of GM-SVR. Since the original parameters are known, we use the criterion of Loss (4.4), which is based on the sum of the squared biases of parameters (Groß, 2003). The Loss criterion can be calculated as follows

The second criterion is the variance of residuals (VAR). The third criterion is the mean square error (MSE) over the testing data according to equation in (4.5). This can be found by using the estimated parameters of these robust regression methods to find out the predicted values of the output variable for the values of input variables before contamination.

The last criterion is the efficiency of each robust method as compared to the efficiency of the OLS method which is given by the next equation

As seen in equation (4.6), the efficiency can be computed by dividing the MSE for the OLS with clean data (OLSC) by the MSE for the robust methods. All results are averaged over 1000 replications of random data sets using R software.

4.4.1 Three-Dimensional Target Function

To provide a better understanding of the GM-SVR estimation, we first simulate a linear regression model with three-dimensional inputs as flows

The values of, are normally distributed with . The variables ( and) are contaminated by different percentages (10%, 20% and 30%) of outliers and bad leverage points. This contamination is done by replacing some observations based on the percentage above by arbitrarily large number equal to 50.

The comparison results of the GM-SVR and the other robust methods are recorded in Table 4.3. This illustrates the estimates for various samples and percentages of contamination. Table 4.3 reveals that the GM-SVR has the smallest Loss equally with GM6 over all the comparison methods. Further, it has the smallest VAR and MSE for all sample sizes and percentages of contamination. This leads to a claim that the GM-SVR method has the highest efficiency (4.6) out of the various robust methods. The results of the comparison are displayed graphically in Figure 4.2. It should be mentioned that the graph is limited to only to the efficiency of GM6 and GM-SVR because of that the other methods are outside the competition, as shown in Table 4.3. A quick look at Figure 4.2, we can clearly observe that the proposed method achieves the highest efficiency. Also, we can note that the efficiency of the robust methods is directly proportional to the sample size, which means it increase whenever the sample size is increased.

Table 4.3: The summary results based on different regression methods for three simulation target function

Loss Cont. n OLSC OLS M MM GM1 GM6 GM.SVR

30 0.0001 22.955 1.1331 0.0144 0.0073 0.0001 0.0001

50 0.00001 13.254 0.9886 0.0008 0.9998 0.0001 0.0001

10% 100 0.00001 8.8948 1.0332 0.00002 1.0281 0.0001 0.0014

150 0.00003 11.855 1.0270 0.00002 1.0271 0.00003 0.0015

2060450200

0.00001

30

0.0001

200

0.00001

30

0.0001

2039747-1523

0.00003

0.0001

0.1496

0.0001

0.00003

0.0001

0.1496

0.0001

9.1386 1.01641.01520.0021

61.144 1.05891.06950.0001

50 0.00001 33.080 1.0462 0.0611 1.0418 0.0002 0.0001

20% 100 0.00001 31.764 1.0304 0.0203 1.0278 0.0001 0.0002

150 0.00003 31.031 1.0233 0.0020 1.0238 0.0001 0.0001

200

0.00001

30

0.0001

200

0.00001

30

0.0001

32.104

0.0001

0.00002

0.5168

0.0003

0.0001

0.00002

0.5168

0.0003

1.01881.01800.0002

30% 50

100 0.00001

0.00001 67.074 66.998

76.441 1.1172

1.0259 0.4377

1.0332 0.5522 1.0835

1.3362

2.7477 0.0001

0.0001 0.0001

0.00001

0.00001

150 0.00003 73.857 1.0240 0.2787 2.3312 0.0001 0.00008

200 0.00001 78.959 1.0108 0.3488 3.2111 0.00001 0.0001

VAR Cont. n OLSC OLS M MM GM1 GM6 GM-

SVR

30 0.8938 16.738 1.6832 0.9979 0.9658 0.9500 0.9315

50 0.9407 9.8698 1.6941 0.9664 1.6997 0.9649 0.9562

10% 100 0.9677 2.6999 1.7396 0.9817 1.7350 0.9876 0.9735

150 0.9840 6.6169 1.8475 0.9804 1.8455 0.9880 0.9857

206045398200

0.9873

30

0.8938

200

0.9873

30

0.8938

2039747-1125

0.9884

0.9845

1.1932

0.9829

0.9884

0.9845

1.1932

0.9829

2.5528 1.86071.85930.9858

28.003 1.65681.63920.9519

50 0.9407 3.9928 1.7373 1.1304 1.7311 0.9686 0.9605

20% 100 0.9677 2.3272 1.7330 1.0812 1.7311 0.9839 0.9822

150 0.9840 2.3389 1.8481 1.0207 1.8465 0.9894 0.9863

200

0.9873

30

0.8938

200

0.9873

30

0.8938

4.9001

1.0005

0.9909

1.4337

1.0042

1.0005

0.9909

1.4337

1.0042

1.86261.86150.9905

30% 50

100 0.9407

0.9677 11.173 3.9114

2.7256 1.6767

1.7366 1.4440

1.7372 1.5378 1.7145

1.8082

1.8346 0.9717

0.9912 0.9852 0.9643

0.9780

150 0.9840 3.0327 1.8417 1.4211 1.9756 0.9888 0.9840

200 0.9873 2.8855 1.8583 1.4891 1.9836 0.9932 0.9909

MSE Cont. n OLSC OLS M MM GM1 GM6 GM-

SVR

% 30 0.9600 20.972 1.8183 1.0779 1.0409 1.0280 1.0067 50 0.9807 14.683 1.7706 1.0106 1.7744 1.0094 0.9999 100 0.9876 9.6896 1.7785 1.0037 1.7733 1.0095 0.9955

150 0.9974 12.646 1.8748 0.9948 1.8725 1.0025 1.0004

206045397200

0.9974

30

0.9600

200

0.9974

30

0.9600

2039747-1125

0.9992

0.9953

1.2975

1.0707

0.9992

0.9953

1.2975

1.0707

9.4333 1.88081.87910.9970

65.255 1.80301.78391.0355

50 0.9807 37.509 1.8226 1.1865 1.8163 1.0171 1.0078

20% 100 0.9876 33.601 1.7754 1.1069 1.7732 1.0072 1.0054

150 0.9974 32.621 1.8756 1.0366 1.8736 1.0047 1.0016

200

0.9974

30

0.9600

200

0.9974

30

0.9600

35.426

1.0120

1.0023

1.5695

1.0971

1.0120

1.0023

1.5695

1.0971

1.88301.88161.0021

30% 50

100 0.9807 0.9876 81.559 76.827

79.383 1.8380

1.8394

1.7792 1.5280 1.5761 1.9249

2.4053

3.5713 1.0247 1.0164 1.0762 1.0155

1.0027

150 0.9974 76.365 1.8719 1.4447 3.2591 1.0054 1.0003

200 0.9974 79.427 1.8798 1.5067 3.9824 1.0059 1.0034

Eff Cont. n OLSC OLS M MM GM1 GM6 GM-

SVR

30 100 4.5775 52.796 89.059 92.227 93.382 95.361

50 100 6.6793 55.392 97.040 55.272 97.155 98.080

10% 100 100 10.193 55.532 98.402 55.694 97.831 99.212

150 100 7.8872 53.199 100.264 53.264 99.490 99.696

53.030

99.816

53.078

100.210

100.035

53.246

73.991

53.815

89.661

92.705

53.030

99.816

53.078

100.210

100.035

53.246

73.991

53.815

89.661

92.705

100 10.573

100 1.4712

100 2.6147

100 2.9393

100 3.0576

100 2.8154

50 53.811 82.658 53.998 96.427 97.314

20% 100 55.629 89.223 55.700 98.061 98.233

150 53.179 96.218 53.234 99.269 99.583

99.531

2039747-17771852.967

98.551

53.006

99.517

52.231

61.166

49.872

87.510

52.967

98.551

53.006

99.517

52.231

61.166

49.872

87.510

100 1.1771 89.202

50 100 1.2766 53.320 64.183 40.775 95.707 96.579

30% 100 100 1.2441 55.512 62.666 27.655 97.168 98.497 150 100 1.3061 53.284 69.040 30.604 99.210 99.714 200 100 1.2557 53.058 66.198 25.045 99.159 99.400

Figure 4.2: The efficiency based on GM6 and GM-SVR methods for three simulation target function

4.4.2 Five-Dimensional Target Function

In this simulated example, five explanatory variables are used to explore the effectiveness of the proposed method by comparison with existing robust methods. The linear regression model with a five-dimensional input is explained as follows

It is well known in the literature of robust statistics that the efficiency of GM estimators depends on the distribution of predictors (Maronna et al), so the values ofare sampled from the double exponential distribution with , while the, is distributed from standard normal distribution. The variables ( and) are contaminated by various percentages of outliers and bad leverage points. This contamination is done by replacing some observations based on the percentage above by arbitrarily large number equal to 100.

Different sample sizes and percentages of contamination are used to investigate the superiority of the proposed method. As we see from the results in Table 4.4, GM-SVR achieves the smallest Loss criterion (in most cases), VAR and MSE. In other words, the MSE of the proposed method is closer to the MSE of the OLSC than the MSE of the other robust methods, which means that the proposed method achieves the highest efficiency among all robust methods. Figure 4.3 illustrates only the efficiency of the GM6 and GM-SVR methods because of the other robust methods produce low efficiency for various sample sizes and contamination percentages. It can be noted that the superiority of the proposed method becomes clear at different sample sizes and contamination levels.

Table 4.4: The summary results based on different regression methods for five- simulation target function

Loss Cont. n OLSC OLS M MM GM1 GM6 GM-

SVR

10% 30

50

100 0.0004 0.0002

0.0001 12.542 17.954

28.465 0.9952 0.9925

0.9947 0.0532 0.0238

0.0158 0.5840 0.9917

0.9941 0.0017 0.0009

0.0003 0.0005 0.0003

0.0008

150 0.0001 24.181 0.9935 0.0037 0.9928 0.0001 0.0022

200 0.0001 28.079 0.9941 0.0031 0.9935 0.0001 0.0041

20% 30

50

100 0.0004 0.0002

0.0001 120.00

121.81

122.54 0.9981 0.9983

0.9978 0.4303 0.3758

0.3368 0.9992 0.9981

0.9976 0.0030 0.0005

0.0003 0.0005 0.0003

0.0002

150 0.0001 121.88 0.9973 0.3539 0.9972 0.0001 0.0001

200 0.0001 121.50 0.9974 0.3288 0.9972 0.0001 0.0007

30% 30

50

100 0.0004 0.0002

0.0001 229.28

260.26

306.88 0.9997 0.9992

0.9989 0.7814 0.8528

0.9549 4.7804 3.4322

8.5931 0.0022 0.0006

0.0002 0.0004 0.0005

0.0001

150 0.0001 290.44 0.9986 0.9561 5.0762 0.0002 0.0001

200 0.0001 305.51 0.9985 0.9841 7.9930 0.0001 0.0002

Cont. n OLSC OLS M MM GM1 GM6 GM-

SVR

10% 30

50

100 3.3114 3.5951

3.7877 55.667 39.348

25.736 5.4487 5.5883

5.7997 3.8050 3.9119

4.0061 4.7586 5.5416

5.7823 4.6263 3.9319

3.8675 3.4706 3.6897

3.8436

150 3.8512 18.261 5.8494 3.9881 5.8365 3.9336 3.9326

200 3.9219 15.981 5.8899 4.0156 5.8807 3.9397 3.9384

20% 30

50

100 3.3114 3.5951

3.7877 140.62

82.077

43.734 5.4856 5.6293

5.8179 4.6783 4.8204

4.8501 6.1742 5.5959

5.8057 4.7109 3.9688

3.8882 3.6297 3.7407

3.8567

150 3.8512 29.935 5.8664 5.0025 5.8573 3.9461 3.9339

200 3.9219 23.445 5.9026 5.0002 5.8958 3.9494 3.9360

30% 30

50

100 3.3114 3.5951

3.7877 175.56

107.36

59.149 5.9017 5.6585

5.8363 5.1499 5.4999

5.6957 27.853 10.703

9.3085 4.6518 4.0408

3.9256 3.7482 3.8167

3.8978

150 3.8512 38.995 5.8763 5.8228 7.1264 3.9656 3.9544

200 3.9219 31.75 5.9129 5.8700 7.4739 3.9638 3.9532

MS E Cont. n OLSC OLS M MM GM1 GM6 GM-

SVR

2060450

10% 30

50

100 3.8412 3.9146

3.9472 78.535 61.627

55.269 6.3842 6.1111

6.0568 4.4464 4.2753

4.1819 5.5380 6.0476

6.0337 5.4859 4.3128

4.0404 4.0514 4.0335

4.0131

150 3.9575 42.831 6.0185 4.1020 6.0021 4.0477 4.0470

200 4.0023 43.963 6.0168 4.1013 6.0053 4.0251 4.0244

20% 30

50

100 3.8412 3.9146

3.9472 307.12

223.60

172.89 6.4705 6.1741

6.0832 5.5162 5.2884

5.0711 7.4384 6.1286

6.0668 5.6092 4.3674

4.0678 4.2717 4.1015

4.0313

150 3.9575 155.73 6.0422 5.1515 6.0300 4.0639 4.0510

200 4.0023 147.48 6.0335 5.1105 6.0251 4.0372 4.0229

30% 30

50

100 3.8412 3.9146

3.9472 480.32

405.23

383.15 6.9906 6.2240

6.1128 6.0919 6.0507

5.9657 44.330 16.771

18.473 5.5417 4.4557

4.1139 4.4399 4.1975

4.0823

150 3.9575 339.23 6.0581 6.0015 11.746 4.0887 4.0761

200 4.0023 344.70 6.0486 6.0046 14.944 4.0558 4.0440

Ef f Cont. n OLSC OLS M MM GM1 GM6 GM-

SVR

10% 30

50

100 100

100

100 4.8910 6.3521

7.1418 60.167 64.058

65.170 86.389 91.564

94.386 69.361 64.731

65.419 70.020 90.768

97.693 94.811 97.054

98.359

150 100 9.2398 65.755 96.477 65.935 97.772 97.788

200 100 9.1038 66.519 97.586 66.646 99.435 99.451

20% 30

50

100 100

100

100 1.2507 1.7507

2.2830 59.365 63.404

64.887 69.635 74.022

77.837 51.640 63.875

65.063 68.480 89.632

97.035 89.923 95.443

97.914

150 100 2.5411 65.498 76.822 65.629 97.381 97.692

200 100 2.7137 66.334 78.315 66.427 99.135 99.489

30% 30

50

100 100

100

100 0.7997 0.9660

1.0301 54.948 62.895

64.573 63.054 64.698

66.164 8.6650 23.341

21.367 69.315 87.857

95.948 86.515 93.260

96.690

150 100 1.1665 65.325 65.941 33.692 96.791 97.090

200 100 1.1611 66.169 66.654 26.781 98.680 98.970

Figure 4.3: The efficiency based on GM6 and GM-SVR methods for five simulation target function

4.5 Conclusion

The ordinary least squares method is the most common technique in regression analysis, because its computational simplicity and excellent statistical properties make it such a powerful technique. Unfortunately, these statistical properties depend on some assumptions that are often violated when there are outliers in the data set. Further, a single outlier is capable of breaking the normal distribution assumption for errors. For this reason, a robust regression procedure that is less sensitive to the presence of outliers is needed or, in other words, a robust regression technique that is more resistant to outliers than the least squares method. The M-estimator is a robust estimator against outliers in the -direction, while the GM-estimator is robust against outliers in both the direction and the -direction. In this study, the most common robust methods, such as M, MM, GM1, and GM6 are studied. In order to increase the efficiency of the GM-estimator, we developed the GM-SVR method in which the initial weight is based on FP-SVR. To compare these robust regression methods, the standard error of the estimated parameters (SE), the median absolute deviation (MAD), the loss of the estimated parameters (Loss), the variance of residuals (VAR), the mean squares of error (MSE) and the efficiency are compared. The effectiveness of the proposed approach is tested for real and simulated data sets. The results from both the real and the simulated data show that the performance of the GM-SVR method is good and robust against both outliers and bad leverage points and that the GM-SVR method clearly outperforms other robust methods. Hence, the GM-SVR is suggested as a robust alternative to the GM6 method.

CHAPTER 5

THE SINGLE-INDEX SUPPORT VECTOR REGRESSION MODEL TO

ADDRESS THE PROBLEM OF HIGH DIMENSIONALITY

5.1 Introduction

The last few years, the applications of support vector machine (SVM) for solving classification and regression problems have been increasing, due to its high performance and ability to transform the non-linear relationships among variables to linear form by employing the kernel idea (kernel function). It is well known that SVM built based on the statistical learning theory and the structural risk minimization principle and it is considered as a universal technique for solving classification and regression problems (Cortes and Vapnik, 1995). Since then, the SVM has attracted the interest of researchers in the machine learning community due to its excellent performance of handling a variety of learning problems (Ceperic, 2014). The widely use of the SVM is related to some additional reasons, such as theoretical guarantees about its performance, lower sensitivity to local minima and high flexibility to add extra dimensions to the input space, which prevents the increasing of the model complexity (Ceperic, 2014). This flexibility of adding extra dimensions is usually related with the ability of the SVM model to produce sparse models (less complexity). However, the SVM model uses the dual domain to address the problem of high dimensionality (Suykens et al. 2002), while the other models used the primal domain to handle this issue such as single index model (SIM). The SIM is semi-parametric approach and it consists of two parts, parametric and non-parametric. The parametric part provides the possibility of dealing with the explanatory variables according to their importance. Thus, the proposed model aims to achieve dimension reduction and, larger accuracy than fully nonparametric (SVM) estimator in the case of high-dimensional data (Horowitz, 2009).

Support vector machine was initially developed for classification problems (Cortes and Vapnik, 1995), but later in the same year, the formulation was extended to cover the regression problems (Vapnik, 1995). Further, the SVM appears to offer excellent generalization ability on real-life classification and regression problems while it is still capable of producing sparse model (Ceperic, 2014). The support vector machine for regression (SVR) is described as a fully nonparametric approach used to solve the problems that are nonlinear and high dimensional (Williams, 2011). The common formulation of SVR is ?-insensitive support vector regression (Vapnik, 1995). The ?-insensitive loss function presents the best characteristics of robustness among various common loss functions, such as Huber’s, Quadratic, Gaussian, and Laplacian loss function (Scholkopf ; Smola, 2002). The use of this formula (?-SVR) produces model depends only on a part of the training data and removing the training data points within the threshold ? to the final model. Unfortunately, this process refers to a potential problem: if the value of threshold is small (? near zero), the resulting model depends on a greater number of the training data points, thus making the solution more complex (non-sparse) (Guo et al., 2010). It should be noted that the SVR uses the dual domain to get rid of the curse of high-dimensionality by employing the concept of threshold ? in terms of producing sparse model. On the other hand, there is possibility to alleviate this issue in the primal domain by using the single index model in addition to avoid non-sparse model. Further, the use of this technique provides the flexibility to study the inferential side such as the importance of variables (the most influential variables) in addition to the dimension reduction and, high accuracy.

Recently, an important question has been raised in applications of the statistical analysis, which is how to reduce influence of the dimensionality problem (Algamal and Lee 2015). To achieve this goal, there are two main directions, the parametric and the nonparametric model. The parametric model has some assumptions and not easy to adapt with high dimension phenomenon. Moreover, the potential approximation error is larger in 2 dimensions than in 1 dimension and could reach to 10 times of 1 dimension and this number will double for each additional explanatory variable (Yatchew, 2003). In contrast, the nonparametric model features flexibility, but it suffers from less precision when the number of covariates are increased (Härdle et al. 2004). To cope with the dimensionality problem Ichimura (1993) studied and developed a single index model, this model summarizes the covariates within a single variable called index. The SIM is considered one of the most popular semi-parametric techniques that used to handle the high dimensionality problem. The advantages of this semi-parametric model (SIM) are appealing to the researchers in application to reduce the dimensionality since it provides high accuracy and at the same time maintain flexibility of a nonparametric model (Horowitz, 2009). Generally, we can say, the single index model combines between the strengths of the parametric model with interpretability and the flexibility of the nonparametric approach.

The remainder of this chapter is summarized as follows. The structure of the proposed single-index support vector regression model is described in section 5.2. In section 5.3 the effectiveness of our proposed method is examined by using three various simulation studies with high-dimensional problems (linear and nonlinear). Further, the performance of our proposed method is evaluated using real data set which is illustrated in section 5.4. Finally, this chapter is concluded with a discussion in section 5.5.

5.2 Single-Index Support Vector Regression (SI-SVR)

In order to illustrate the proposed technique SI-SVR, we consider the following single index model

where is the response variable, is the covariate, is the error term, is the unknown nonparametric link function, is the unknown parameters vector.

Ichimura (1993), proposed two approaches to estimate the coffecients vector: The semi-parametric least squares (SLS) and the weighted semi- parametric least square (WSLS) methods. In order to satisfy the identification of the single index model, he set , and the first component is positive . The estimation procedure for the single index mode can be carried out at first by estimating the coefficients vector with – rate of convergence, and then estimates the unknown nonparametric link function.

The conditional variance of these methods, SLS and its weighted version WSLS can be written as follows

These methods are used to find a suitable estimator for the coefficients vector that minimizes the next objective function, which represents the right side of

the conditional variance

so that the parameters estimates can be computed as follows

where is the trimming factor, and is the unknown nonparametric link function estimator of .

In this chapter, we will use the support vector regression to estimate the link function since it is considered one of the most popular kernel regression methods that helps alleviate the non-linearity and the dimensionality problems. To estimate the single index by support vector regression, we consider the next model

where is the weight vector, is a nonlinear function and is the bias term. The function should be as flat as possible (Vapnik 1995). This feature of flatness can be provided by minimizing the Euclidean norm.

To optimize the generalization ability (predicted risk), the coefficients and should be estimated by minimizing the next ?-insensitive loss function (Vapnik, 1995)

The problem (5.5) could be presented as a convex optimization problem

where C is the penalized parameter which determines the trade-off between the flatness of and the amount up to which deviations larger than ?-zone are tolerated, are the slack variables which are used to measure deviations of the training vectors outside the ?-zone.

To solve the optimisation problem in (5.7) easily, the dual formulation is used.

Thus, applying a set of partial derivatives leads to the following SI-SVR model

where is the kernel function that is used to overcome the non-linear relationships among variables in the input space.

5.3 Training and Testing data

First and foremost, the test partition is created to provide researchers honest assessments of the performance of the predictive models. Further, the results based on the training data will not be convincing to an experienced observer because of the easy manipulation it, by controling the free parameters (RojoÁlvarez et al. 2003). This is the main reason to divide data into two groups, training and testing sets. We train regression using the training set, tune the parameters using any tuning method and then test the performance of SVR model on unseen set (test set). It should be noted is that during training the regression only the training set is available, and during testing the predictive model only the test set is available.

5.4 Simulations Studies

In this section, we compare the proposed method, the single index support vector regression model, with the support vector regression method (SVR) in the case of high-dimensional problem. Three examples have been used to study the performance of the proposed SI-SVR method and the SVR method. The first two examples combine the problems of non-linear and high-dimensionality, while the third example is linear target function (Hu et al. 2013; Wu et al. 2010). These methods are evaluated based on the prediction risk criterion (the mean squared error for test data).The prediction risk (MSE) is averaged over 100 replications of random data sets. These data sets have been divided to 0.70 as training samples and 0.30 as testing samples. The kernel function that is used in this section is the radial basis function (RBF).

5.4.1 Four-Dimensional Target Function

Consider the following single-index regression model

where the set of coefficients with , the set of explanatory variables are generated identically and independently from a normal distribution with 0, 0.5, and the variable is sampled from a standard normal distribution. The models are built using five different sample sizes, = (30, 50, 100, 200, and 250) with 100 replications.

Table 5.1 reports the prediction error (MSE) of the SVR and SI-SVR methods based on different sets of free parameters (C, ? and h) by two values for each parameter (small, and large). All results refer to the superiority of the proposed method than SVR for all combinations of the parameters, the proposed method gives less MSE. Figure 5.1 also shows the superiority of the SI-SVR method compared to the SVR method for all sample sizes because of some sharp jumps (ups and down) of the curve of MSE of SVR. These jumps increase its intensity when the sample sizes are increased for SVR method, while this sharpness decreases for the proposed method. However, we can note some convergence between the values of the MSE of the two methods when a small value of the kernel parameter h, and expansion in case of high value of this parameter. Generally, the values of prediction error are minimized with increasing of sample size for SI-SVR and SVR methods.

Table 5.1: The MSE of SVR and the SI-SVR methods for four-dimensional target function

n Par SVR SI-SVR

? =0 ? =0.2 ? =0 ? =0.2

C=1 C=100 C= 1 C=100 C=1 C=100 C= 1 C=100

30 h=1 h=5 7.6003

9.8700 6.0199

9.4171 7.6159

9.8771 5.5787

9.4396 6.3306

8.1218 4.5076

5.9637 6.2903

8.0910 4.1712

5.8110

50 h=1 h=5 4.2403

8.9738 4.0531

6.2776 4.2518

9.0068 3.6186

6.3901 3.4044

4.7968 2.5257

3.9526 3.4015

4.7975 2.4509

3.6590

100 h=1 h=5 3.4154

8.6332 3.7851

6.1334 3.4327

8.6612 3.4025

6.1244 1.9863

3.2002 1.2603

2.0673 1.9884

3.1874 1.2106

1.9705

200 h=1 h=5 2.9144

7.5402 2.9814

5.6689 2.9207

7.5647 2.7498

5.5497 2.3076

3.0683 1.5373

2.6074 2.3019

3.0579 1.5170

2.5713

250 h=1 h=5 2.3431

6.1838 2.5419

4.9297 2.3450

6.2046 2.3247

4.7200 1.3881

1.8872 1.0715

1.3530 1.3890

1.8878 1.0686

1.3230

Figure 5.1: The MSE of SVR and SI-SVR methods for four- dimensional target function

5.4.2 Eight-Dimensional Target Function

In this example, we consider the single-index model with eight explanatory variables as follows

where with , the variables are distributed from a standard normal distribution, and the error term is generated from the exponential distribution with parameter 1/2, using the same sample sizes and the sets of free parameters which were used in the previous example. The results of applying SVR and SI-SVR are presented in Table 5.2 and Figure 5.2, which again demonstrate the superiority of SI-SVR. According to Figure 5.2, the MSE curve of the SVR method tends to the flatness for all sample sizes, whether small, moderate or large. While, there are some leaps appear in the curve of MSE of the SI-SVR method for moderate sample size especially when n=100. However, the proposed method has maintained the notability over the standard SVR method for all sample sizes.

Table 5.2: The MSE of SVR and SI-SVR methods for eight-dimensional target function

n Par SVR SI-SVR

? =0 ? =0.2 ? =0 ? =0.2

C=1 C=100 C= 1 C=100 C=1 C=100 C= 1 C=100

30 h=1 h=5 8.8331

8.8549 8.9143

8.9738 8.8551

8.8770 8.9431

8.9991 7.4562

7.3049 7.6420

7.0608 7.4555

7.3440 7.5847

6.6986

50 h=1 9.6498 9.4324 9.6456 9.4531 7.5855 8.1744 7.5824 7.8110

h=5 9.7213 9.7382 9.7159 9.7467 7.6007 7.7830 7.5789 7.3709

100 h=1 11.293 10.994 11.287 11.014 9.6509 10.069 9.5987 9.9314

h=5 11.454 11.415 11.442 11.420 9.5728 10.707 9.5468 10.350

200 h=1 10.239 9.7983 10.238 9.8240 8.3910 8.3660 8.3650 8.3313

h=5 10.523 10.460 10.511 10.463 8.5566 9.0269 8.5443 8.8827

250 h=1 9.4830 9.1204 9.4851 9.1383 5.8306 5.7093 5.8041 5.6400

h=5 9.7046 9.6765 9.7049 9.6824 6.2870 6.6161 6.2724 6.4472

Figure 5.2: The MSE of SVR and SI-SVR methods for eight- dimensional target function

5.4.3 Fifteen-Dimensional Target Function

The next example explores the effectiveness of the proposed method on linear model with fifteen explanatory variables

where with , the set of explanatory variables and the additive error are distributed from normal distribution with mean 0 and variance 1. This example is different with the previous two nonlinear examples as there is a linear relationship among the dependent and independent variables. However, it is similar in terms of the sample sizes and the set of free parameters.

The resulting estimates of applying SVR and SI-SVR are summarized in Table

5.3 and Figure 5.3. Again findings refer to the superiority of the proposed SISVR method over the standard SVR method. If we look at Table 5.3, we find that the MSE of the SVR method is almost constant (revolves around 16) for all sample sizes and for different values of the free parameters. On the other hand, we see that the proposed method achieved a remarkable decrease in the MSE values with increasing the sample sizes which distinguishes the SI-SVR over SVR method. This decrease in the MSE values of the proposed SI-SVR method is clearly shown in Figure 5.3. Figure 5.3 shows some ripples in the MSE curve of SI-SVR but do not change its superiority as compared to the SVR method. In general, the proposed method offers good performance whenever the dimensionality of the explanatory variables is increased in comparison with the standard SVR method.

Table 5.3: The MSE of SVR and SI-SVR methods for fifteen -dimensional target function

n

Par

SVR

SI

–

SVR

?

=0

?

=0.2

?

=0

?

=0.2

C=1

C=100

C=1

C=100

C=1

C=100

C=1

C=100

n

Par

SVR

SI

–

SVR

?

=0

?

=0.2

?

=0

?

=0.2

C=1

C=100

C=1

C=100

C=1

C=100

C=1

C=100

h=1 16.435 16.086 16.418 16.083 10.329 7.844 10.310 7.711

30

h=5 16.436 16.088 16.418 16.085 12.591 9.765 12.563 9.532

h=1 16.505 16.222 16.459 16.216 8.416 5.555 8.408 5.496

50

h=5 16.506 16.224 16.460 16.218 11.007 7.799 11.016 7.574

h=1 16.069 16.026 16.066 16.028 5.878 3.637 5.873 3.592

100

h=5 16.071 16.033 16.068 16.035 8.432 5.513 8.428 5.372

h=1 15.956 15.900 15.948 15.900 3.966 2.492 3.961 2.470

200

h=5 15.960 15.913 15.952 15.913 5.851 3.605 5.841 3.537

h=1 16.071 16.025 16.068 16.026 3.626 2.178 3.625 2.163

250

h=5 16.075 16.043 16.072 16.042 5.321 3.109 5.315 3.064

Figure 5.3: The MSE of SVR and SI-SVR methods for fifteen – dimensional target function

5.5 Real Case Study

In this section, the Prostate cancer is used as a real data example. The Prostate cancer data consist of eight explanatory variables based on eight clinical measures. The criterion of the prediction risk has been used for comparison between of these methods. The RBF kernel function is used to map the inputs into a high dimensional feature space.

5.5.1 Prostate Cancer Data

In this part, Prostate cancer data have been utilized to evaluate the proposed method (SI-SVR) under condition of high-dimensionality. The data come from a study of prostate cancer that was done by Stamey et al. (1989). The response variable is the logarithm of prostate specific antigen. The eight predictor variables are respectively, log (cancer volume) (), log (prostate weight) (), age (), the logarithm of the amount of benign prostatic hyperplasia (), seminal vesicle invasion (), log(capsular penetration) (), Gleason score () and percentage Gleason score 4 or 5 (). All of these variables are standardized to satisfy. This data set consists of 97 observations and it has been divided to 0.75 as training samples and 0.25 as testing samples. This data set is given in R software, package “lasso2” under name “Prostate”.

Table 5.4, summarizes the mean square error (MSE) for these methods (SVR and SI-SVR) with set of parameters (C, ? and h) by three values for each parameter (small, moderate and large). In contrast, Figure 5.4 shows the differences among the prediction risks (MSE) for these techniques. We can see clearly that the proposed method has lower prediction risk compared with SVR method for all combinations of the free parameters. Also it can be noted that there are some sharp jumps (ups and down) in the curve of MSE especially for SVR method. In fact, these jumps are because the differences between the parameters (C, ? and h). On the other hand, we can observe that the curve of the MSE of the proposed SI-SVR method is much softer than the curve of the standard SVR method which reflects the superiority of the proposed method for prostate cancer data. In general, the proposed SI-SVR method is better than the SVR method in terms of addressing the problem of high-dimensional.

Table 5.4: The MSE of SVR and SI-SVR methods for

Prostate cancer data

Parameters

SVR

SI

–

SVR

C=1

C=50

C=100

C=1

C=50

C=100

0.3827

Parameters

SVR

SI

–

SVR

C=1

C=50

C=100

C=1

C=50

C=100

0.3827

? =0.0 1.1370 1.1355 1.13550.6301 0.3936

h =0.5 ? =0.1 1.1390 1.1173 1.1173 0.6081 0.3980 0.3845

? =0.2 1.1378 1.1373 1.1373 0.5749 0.4017 0.3887

h =1 ? =0.0 ? =0.1 1.2540

1.2511 1.1684

1.1913 1.1684

1.1913 0.4673

0.4987 0.4060

0.4078 0.4082

0.4093

? =0.2 1.2436 1.2058 1.2058 0.5469 0.4020 0.3998

h =5 ? =0.0 ? =0.1 1.4889

1.4789 1.4155

1.4147 1.4155

1.4147 0.4671

0.5066 0.4853

0.5022 0.5320

0.5322

? =0.2 1.4759 1.4140 1.4140 0.5616 0.4810 0.5309

Figure 5.4: The MSE of SVR and SI-SVR methods for

Prostate cancer data

5.6 Discussion and Conclusion

In this chapter, we have proposed the single-index support vector regression technique for dimensional reduction. It is a semi-parametric approach combines with a high accuracy of the parametric methods and the flexibility of the non-parametric methods. The support vector regression has been implemented for the unknown nonparametric link function.

In order to compare our proposed method, SI-SVR, with fully nonparametric SVR method, three simulation examples and real data set have been used. Further, same conditions have been provided for comparison in terms of the use of the same combination of free parameters and sample sizes. Finally, the comparison results are demonstrated, the superiority of proposed method, SISVR over SVR method to reduce of the curse of high dimensionality in both cases of linear and nonlinear target functions.

CHAPTER 6

ELASTIC NET FOR SINGLE INDEX SUPPORT VECTOR REGRESSION

MODEL

6.1 Introduction

The linear regression model is one of the most popular techniques that are used to study the relationship among the dependent variable (output) and the vector of explanatory variables (inputs). Unfortunately, the relationships between variables in most real applications are nonlinear which reduces the application of the linear regression model. So the use of linear regression model to describe the nonlinear relationships is not appropriate choice. To enhance the elasticity of the model, a single index model was proposed with a smooth unknown link function

(Ichimura, 1993). It is an extension of the linear regression model to deal with nonlinear relationships among variables. It is more flexible than the parametric models and at the same time keeps their good properties. The single index model is a semi parametric model and it consists of two parts, parametric and nonparametric. The parametric part and the nonparametric part of the model need to be evaluated simultaneously. In order to build the single index model, two steps should be implemented: First, is to estimate the parameters, and the other is to estimate the unknown link function. It is well known that the convergence rate of the parametric estimator is much faster than the convergence rate of the nonparametric estimator (Peng and Huang, 2011). Hence, the estimation of the parameters accurately and efficiently leads to a good estimate for the link function of the single index model. However, if the set of explanatory variables contains some irrelevant variables (noise) or includes hundreds of variables, the precision of the estimation of parameters will deteriorate by the curse of high dimensionality. As a result, selecting the most significant variables from a set of inputs is a very important issue for the single index model. The other problem that affects the estimate of the parameters of the single index model in addition to the presence of noise variables is when the number of predictors p is greater than the sample size n (less than full rank ), but the number of significant variables is typically less than n. The presence of this problem leads to the impossibility of estimating parameters without the process of selecting variables.

Recently, the single index model which is developed by Ichimura (1993), has been widely studied by many researchers and it has gained much attention due to its excellent performance to deal with high dimensional problems in standard mean regression. The single-index model achieves dimension reduction efficiently and avoids the so-called the curse of high dimensionality because, the index aggregates the dimension of covariates. Hence, in a single-index model can be estimated with the same convergence rate in probability that it would have if the one-dimensional quantity were observable (Horowitz, 2009). Furthermore, the parameters can be estimated with the same convergence rate,, that is achieved in a parametric linear model. Consequently, in terms of convergence rate in probability, the singleindex model is considered as accurate as a parametric model to estimate and as precise as a one dimension nonparametric mean regression to estimate. The ability of the single-index models for dimension reduction gives them a considerable feature over non parametric techniques in applications where is multi-dimensional and the structure of single-index is plausible. This advantage comes from the reality that the single index model is a semiparametric model. It is well known that the parametric model has some assumptions and not easy to adapt with the problem of high dimensionality. In contrast, the nonparametric model features flexibility, but it suffers from decrease precision with an increased in the covariates (Härdle et al. 2004). The combination between parametric and non-parametric models results in hybrid assumptions for the single-index models. These hybrid assumptions of the single-index models are weaker than the assumptions of the parametric model and stronger than those of the fully nonparametric model. In other words, the single-index models provide high accuracy and at the same time maintain flexibility of a nonparametric model (Horowitz, 2009). Therefore, the single index models minimize the risk of miss-specification relative to the parametric models and avoid some drawbacks of fully nonparametric models such as the difficulty of interpretation, and lack of capability of extrapolation. Generally, we can say, the single index models have the strengths of the parametric model with interpretability and the flexibility of the nonparametric model.

Variable selection is a very important tool in function approximation and it has become the focus of much research in the fields of application for which data sets with tens or hundreds or more of variables are available (Guyon, and Elisseeff, 2003). In most applications, the generalization ability can be deteriorated if redundant predictors are included. Variable selection which solves the problem of deteriorating the generalization ability, directly reduces the number of original predictors by selecting their significant subset that still retains the generalization ability compared with that of the original inputs. Penalization techniques have been proposed as promising techniques to improve the ordinary least squares method (OLS) when violate one or more of Markov assumptions. Tibshirani (1996) has proposed a new and promising method called the LASSO. It is a penalized least squares technique imposing an

-penalty on the coefficients of regression. Due to the nature of the -penalty, the lasso technique does both continuous shrinkage and variable selection simultaneously. Further, the lasso is appealing for many researchers owing to its ability to introduce sparse representation. However, the lasso technique has some drawbacks (Zou and Hastie, 2005): First, in the case when p is larger than n, the lasso selects at most n predictors before it saturates, due to the nature of the convex optimization problem. This appears to be a limiting advantage for a variable selection method. Second, if there are very high pairwise correlations among a group of variables, the lasso tends to choose single variable from the group randomly without taking into consideration the importance of variables. To overcome these problems, Zou and Hastie ( 2005) proposed the Elastic net method to improve the ability of lasso for variable selection when p is greater than n.

The estimation of the nonparametric part of the single index model is important as the estimation of the parameters of the fact that this part is complementary to the predictive model, so it should be estimated efficiently. In this chapter, we use the support vector machine to evaluate the unknown link function. Support vector machine (SVM) is a set of supervised learning algorithms that can be applied to solve classification (SVC) and regression (SVR) problems (Vapnik 1995). It is fully nonparametric approach and considered as an important new methodology in the field of neural networks and non-linear modelling (Suykens et al. 2002). An interesting feature of the SVR model is that one can obtain a sparse solution by the use of dual space, in the sense that some observations in the solution are equal to zero (Suykens et al. 2002). This advantage allows the SVR model to alleviate the so-called the curse of high dimensionality including the problem when p is larger than n , which is the focused in this chapter. From here, the use of the SVR model to estimate the unknown link function, provides sufficient flexibility to solve the problem in primal space as well as dual space.

In this chapter, we suggest an extension of the SIM model of Ichimura et al. (1993) by considering the Elastic Net penalty method for estimation and variable selection. Further, the support vector regression tool has been proposed for estimating the unknown link function , while the non-linear least squares has been considered to estimate the vector of parameters ( Ichimura 1993).

The rest of this chapter is organized as follows. Penalized SIM with Elastic Net is introduced in section 6.2. In section 6.3, the support vector regression is proposed to estimate the unknown nonparametric link function for the Elastic net single index model (ENSI). Numerical studies are conducted under various settings to evaluate the proposed method in section 6.4. In section 6.5, a real data analysis is reported to illustrate the proposed method. Finally, the conclusions are summarized in section 6.6.

6.2 Elastic Net Single Index (ENSI)

The Elastic net penalty is proposed by Zou and Hastie (2005) for simultaneous variable selection and parameter estimation. It is an adjusted regression model that combines between two techniques, LASSO ( norm) and Ridge regression ( norm). The penalty of LASSO tends to select individually correlated features and discards the others whereas Ridge penalty shrinks them towards each other. Mixing between these two methods provides the ability to overcome the limitation faced by LASSO, namely its inability to select more predictors than the existing observations in the dataset. Thereupon, the use of this technique helps to get rid of redundant variables. On the other hand it achieves non-zero determinant of the predictors matrix, which allows us to apply the single index model. The proposed Elastic Net Single index here is for simultaneous achieving variable selection, parameter estimation and mean regression estimation. In order to illustrate the Elastic Net Single index methodology, we consider the next model

where is the response variable and, is the p-dimensional vector of predictors, is the vector of unknown coefficients, is the residual term, is an arbitrary smooth nonparametric link function. This method aims to find a suitable estimator for the coefficients vector that minimizes the next objective function.

where is the unknown link function estimator of , and correspond to the LASSO and Ridge regression penalties, respectively which control the amount of regularization applied to the estimation. It should be noted that if, this leads the Elastic-Net estimation back to the LASSO estimation, while if, this leads back to the Ridge estimation.

Thus, the estimates of the coefficients can be computed by using the following minimization problem

The main goal is to find efficient estimators for the vector and the nonparametric link function. As is inside the link function, the challenge is to find a convenient parametric estimator for, provided it reaches the – consistency rate. When the estimator of is found with

-convergence rate, then we estimate the unknown nonparametric link function. To estimate the coefficients vector, we can use the semiparametric least squares (SLS) or its weighted version (WSLS) introduced by Ichimura (1993). It is required to set, and the first component is positive in order to satisfy the identification of the single index model (Ichimura, 1993).

6.3 Estimation of the Unknown Link Function

As already mentioned, the estimation of a single index model is carried out in two steps. First, estimating the vector of coefficients, and then using the resulting index values to estimate the unknown link function by univariate nonparametric kernel regression technique of on. It should be taken into account when estimating a single index model that the functional shape of the link function is unknown. Further, since the form of will determine the value of conditional expectation for a given index , estimation of the index coefficients will have to be adjusted to a specific estimating of the link function in order to yield a correct regression value (Härdle et al. 2004). Thus, in single index model both the index and the link function have to be estimated although only the link function has nonparametric feature.

In this part, we suggest to use the support vector regression model to estimate the link function since it is considered as one of the most popular kernel regression methods in the machine learning community that used to solve the problems of non-linearity and the curse of dimensionality. For this purpose we can consider the support vector regression model with the single

index as follows

where is the weight vector, is a nonlinear function and is the bias term. According to Vapnik (1995), the function should be as flat as possible. Flatness of the function can be achieved by minimizing the Euclidean norm.

The coefficients and should be estimated by minimizing the following ?tube loss function to optimize the generalization ability (predicted risk),

(Andreou et al. 2009; Cristianini ; Shawe-Taylor, 2000; Vapnik, 1995)

The problem (6.4) is equivalent the next convex optimization problem

where the parameter C determines the trade-off between the flatness and the number deviations larger than ?-tube that are tolerated and are positive slack variables used to measure deviations of the training vectors outside the ?tube.

The dual formulation is used to solve the optimisation problem in (6.6). Some of partial derivatives have been applied to get the final Elastic Net Single Index Support Vector Regression model (ENSI-SVR).

where is the kernel function that is used to transform the non-linear relationships in the input space to be in a linear form in the high dimensional feature space.

The estimation procedure of the proposed method ENSI-SVR can be summarized by the next few steps.

Step 1: Estimate the vector of coefficients using Elastic Net variable selection method to achieve full rank matrix.

Step 2: Use the full rank predictor variables matrix from (step 1) to estimate the vector of coefficients by using SLS method.

Step 3: Create the observations of the new independent variable (the single index values).

Step 4: Use the SVR technique to estimate the link function of the single index values.

6.4 Simulation Examples

In this section, three simulation examples are presented to compare the proposed method , the ENSI-SVR and the standard SVR in case of highdimensionality problem especially when p is larger than n. The purpose of these simulations is to show the effectiveness of the proposed ENSI-SVR method over the standard SVR technique in terms of dimension reduction and prediction accuracy. The first example combines the non-linear and highdimensionality problems (Hu et al. 2013; Wu et al. 2010). The next two examples combine the same problems mentioned in first example in addition to the problem when p larger than n (Peng and Huang, 2011). These methods are evaluated using the mean squared error (MSE) for testing data. The prediction risk (MSE) is averaged over 100 replications of random data sets. These data sets have been divided into two groups, the percentage of the training samples, which is used to build the model is 70% and the percentage of the testing samples, which is used to test the model is 30%. The kernel function that is used to transform the relationships among variables in this section is the radial basis function (RBF), and all calculations are implemented using R software.

6.4.1 Simulation I

In this example, we consider the following single-index model with 20 predictor variables

where with , the set of predictors are generated from uniform distribution with

0,1, while the additive error is distributed from the exponential distribution with parameter 1/2. The models are built using three different sample sizes, n= 50,100,200 with 100 replications.

The results of the implementation of the SVR and ENSI-SVR are summarized in Table 6.1. It consists of the prediction risk (MSE) of these methods by using two different values of the parameters C, ? and h (small, and large). All results refer to the superiority of the proposed method over SVR method for different sample sizes and all combinations of the parameters. According to Figure 6.1, the proposed ENSI-SVR achieved the minimum values of the MSE compared to the SVR method for all values of samples and parameters. It is clear that the curve of the prediction risk tends to flatness for small sample size (n=50), but with increase of sample sizes we can note some ripples appearing. However, these ripples do not affect overall results of the proposed method.

Figure 6.1: The MSE of SVR and ENSI-SVR for 20 predictors

Table 6.1: The MSE of SVR and ENSI-SVR methods for 20 predictors

n Par SVR ENSI-SVR

? = 0 ? =0 .2 ? = 0 ? =0 .2

C=1 C=100 C=1 C=100 C=1 C=100 C=1 C=100

50 h=1 h=5 5.6617

5.7493 5.5174

5.5955 5.6398

5.7259 5.5191

5.5990 4.0205

4.0349 4.0144

4.1925 4.0036

4.0275 3.8498

3.8417

100 h=1 5.7606 5.2534 5.7718 5.2670 4.4500 4.7066 4.4362 4.5639

h=5 6.3634 6.3064 6.3646 6.3106 4.4353 4.7992 4.4002 4.5484

200 h=1 5.2930 4.9759 5.2872 4.9668 4.6369 4.8669 4.6344 4.8026

h=5 6.1728 6.0852 6.1648 6.0861 4.5953 4.8190 4.5716 4.7357

6.4.2 Simulation II

In the next example, we simulated 100 data sets consisting of 40 predictors and two sample sizes (n=30 and 100) based the following single index model. The first sample size is less than full rank since p larger than n.

where with , the set of predictors and the additive error are independent identically distributed N(0,1).

The values of the prediction risk of the compared methods (SVR and ENSISVR) are presented in Table 6.2. Two different values of the parameters C, ? and h (small and large) have been used to find these values (MSE). Figure 6.2 illustrates these results graphically for two sample sizes. According to Table and Figure 6.2, the proposed method succeeded in achieving values of the MSE lower than the SVR method for all values of parameters in case of full and less than full rank. Further, percentage of the MSE of the SVR method to the proposed ENSI-SVR amounted to more than 10 times in most of cases. All this reflects the superiority of the proposed method over SVR method.

Table 6.2: The MSE of SVR and ENSI-SVR methods for 40 predictors

n

Paramete

rs

SVR

ENSI

–

SVR

?

=0

?

=0.2

?

=0

?

=0.2

C=1

C=10

0

C=1

C=10

0

C=1

C=10

0

C=1

C=10

0

h=1

0.429

0.427

0.423

0.423

0.027

0.064

0.039

0.038

n

Paramete

rs

SVR

ENSI

–

SVR

?

=0

?

=0.2

?

=0

?

=0.2

C=1

C=10

0

C=1

C=10

0

C=1

C=10

0

C=1

C=10

0

h=1

0.429

0.427

0.423

0.423

0.027

0.064

0.039

0.038

30

h=5 0.4290.4270.4240.4230.0590.1220.0870.087

10 h=1 0.3996 0.3982 0.3960 0.3967 0.0136 0.0195 0.0185 0.0174

0 h=5 0.3996 0.3986 0.3968 0.3967 0.0205 0.0331 0.0290 0.0299

0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

n=30

Combination of SVR Parameters

SVR

ENSI-SVR

0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

n=100

Combination of SVR Parameters

SVR

ENSI-SVR

0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

n=30

Combination of SVR Parameters

SVR

ENSI-SVR

0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

n=100

Combination of SVR Parameters

SVR

ENSI-SVR

Figure 6.2: The MSE of SVR and ENSI-SVR for 40 predictors

6.4.3 Simulation III

This example is the same as previous example (Simulation II), except that the data sets consist of 50 predictors with two sample sizes (n=40 and 100). However, the number of predictors is still larger than the sample size for first data set. We let with, the set of predictors and the error term are sampled from standard normal distribution.

The comparison results of ENSI-SVR and SVR are displayed in Table 6.3 and Figure 6.3. The results of this example do not differ significantly from the results of the previous example with the exception of the number of the predictors is 50 instead of 40, which means an increased in the number of noise variables by 10 variables. Despite the significant increase of the number of noise variables, the method of ENSI-SVR has maintained low levels of prediction error, which indicates the importance the tool of variable selection in the model. In general, the results point out to the importance of the proposed method especially in cases when p is larger than n.

Table 6.3: The MSE of SVR and ENSI-SVR methods for 50 predictors

n

Paramete

rs

SVR

ENSI

–

SVR

?

=0

?

=0.2

?

=0

?

=0.2

C=1

C=10

0

C=1

C=10

0

C=1

C=10

0

C=1

C=10

0

h=1

0.490

0.487

0.483

0.483

0.018

0.031

0.0314

0.031

n

Paramete

rs

SVR

ENSI

–

SVR

?

=0

?

=0.2

?

=0

?

=0.2

C=1

C=10

0

C=1

C=10

0

C=1

C=10

0

C=1

C=10

0

h=1

0.490

0.487

0.483

0.483

0.018

0.031

0.0314

0.031

40

h=5 0.4900.4870.4830.4830.0360.0520.0660 0.065

10 h=1 0.3998 0.3981 0.3978 0.3977 0.0125 0.0178 0.0166 0.0179

0 h=5 0.3993 0.3984 0.3978 0.3977 0.0178 0.0287 0.02694 0.0262

0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

n=40

Combination of SVR Parameters

SVR

ENSI-SVR

0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

n=100

Combination of SVR Parameters

SVR

ENSI-SVR

0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

n=40

Combination of SVR Parameters

SVR

ENSI-SVR

0

0.2

0.4

0.6

1

2

3

4

5

6

7

8

n=100

Combination of SVR Parameters

SVR

ENSI-SVR

Figure 6.3: The MSE of SVR and ENSI-SVR for 50 predictors

6.5 Real Case Studies

In this part, the comparison methods, ENSI-SVR and the SVR are illustrated through an analysis of the Body dimensions data (Heinz et al., 2003) and the near infrared spectroscopy data (Liebmann et al. 2009). These data sets are available in R software, package “Brq” under the name “Body” and package ” chemometrics” under the name “NIR” respectively. The mean square error of the testing data (MSE) has been used to evaluate these methods. The radial basis function (RBF) Kernel function is used to transform the inputs into high dimensional feature space. These data sets have been divided to 0.75 as training samples and 0.25 as testing samples.

6.5.1 Body Dimensions Data

In this part, Body dimensions data have been utilized to evaluate the proposed method (SI-SVR) in case of real data. The measurements were initially taken by Heinz et al. (2003). This data set consists of 21 predictors (p=21) in addition to four measurements as dependent variables, such as age, gender, height, and weight on 507 individuals (n=507). In this chapter, we fit the weight with all 21 explanatory variables: biacromial diameter (BiacSk), biiliac diameter (BiilSk), bitrochanteric diameter (BitrSk), chest depth among spine and sternum at the level of nipple (CheDeSk),chest diameter at nipple level (CheDiSk), elbow diameter (ElbowSk), wrist diameter (WristSk), knee diameter (KneeSk), ankle diameter (AnkleSk), shoulder girth over deltoid muscles (ShoulGi), chest girth (ChestGi), waist girth (WaistGi), navel or abdominal girth (NavelGi), hip girth at level of bitrochanteric diameter (HipGi), thigh girth below gluteal fold (ThighGi), bicep girth (BicepGi), forearm girth (ForeaGi), knee girth over patella (KneeGi), calf maximum girth (CalfGi), ankle minimum girth (AnkleGi), and wrist minimum girth (WristGi). All of these variables (inputs in addition to output) are standardized to achieve.

Table 6.4: The MSE of SVR and ENSI-SVR methods for Body dimensions data

Parameters

SVR

ENSI

–

SVR

C=1

C=50

C=100

C=1

C=50

C=100

0.5581

0.0552

Parameters

SVR

ENSI

–

SVR

C=1

C=50

C=100

C=1

C=50

C=100

0.5581

0.0552

? =0.0 0.5833 0.55810.0870 0.0526

h =0.5 ? =0.1 0.6017 0.5816 0.5817 0.0854 0.0550 0.0521 ? =0.2 0.6272 0.6108 0.6109 0.0923 0.0540 0.0525 ? =0.0 0.9913 0.9749 0.9750 0.1234 0.0598 0.0647 h =1 ? =0.1 0.9970 0.9851 0.9851 0.1220 0.0757 0.0750 ? =0.2 1.0042 0.9954 0.9955 0.1254 0.0824 0.0741 ? =0.0 1.0972 1.0946 1.0946 0.2288 0.2234 0.2770 h =5 ? =0.1 1.0962 1.0946 1.0947 0.2321 0.2147 0.2486 ? =0.2 1.0955 1.0947 1.0948 0.2360 0.1871 0.1927

Table 6.4 and Figure 6.4, summarizes the results of applying SVR and ENSISVR with set of parameters (C, ? and h) by three values for each parameter (small, moderate and large). According to Figure 6.4, there are some sharp leaps appearing in the curve of MSE especially the SVR method for various values of the parameters, whether small, moderate or large. While the MSE curve of the ENSI-SVR method seems more moderate. On the other hand, the proposed method achieved low levels of MSE and are almost near zero for different values of the parameters, except when the kernel parameter is high (h=5), whereas the SVR method achieved high levels of MSE for all of the parameters values which again reflects the superiority of the proposed ENSISVR method.

Figure 6.4: The MSE of SVR and ENSI-SVR for Body dimensions data

6.5.2 The NIR data

In this subsection, the near infrared spectroscopy (NIR) data has been used to evaluate the proposed method (ENSI-SVR) in case of rank deficient data. This data consists of 235 Variables (p=235) contain the first derivatives of near infrared spectroscopy absorbance values with 166 alcoholic fermentation mashes (n=166) of various feedstock (wheat, rye and corn), and two variables (output) containing the concentration of ethanol and glucose. In this example we have used the ethanol as dependent variable (Y) with all 235 input variables

(X).

The results of applying SVR and ENSI-SVR are summarized in Table 6.5 and Figure 6.5 with combination of parameters (C, ? and h) by three values for each parameter. According to these results, the proposed method achieved low levels of MSE for different values of the parameters, whereas the SVR method achieved high levels of MSE for all of the parameters values which reflects the superiority of the proposed ENSI-SVR method.

Table 6.5: The MSE of SVR and ENSI-SVR methods for NIR data

Parameters

SVR

ENSI

–

SVR

C=1

C=50

C=100

C=1

C=50

C=100

39

7

62.2

Parameters

SVR

ENSI

–

SVR

C=1

C=50

C=100

C=1

C=50

C=100

39

7

62.2

? =0.0 653 517 460 62.8

h =0.5 ? =0.1 654 518 396 461 63.2 62.8 ? =0.2 653 517 396 460 63.1 63 ? =0.0 651 397 198 514 75.5 75.3 h =1 ? =0.1 650 398 199 515 75.8 75.6 ? =0.2 651 397 197 513 76.2 75.9 ? =0.0 629 144 144 596 68.3 37.1 h =5 ? =0.1 626 143 144 595 69.1 36.9 ? =0.2 624 144 145 594 70.1 36.6

Figure 6.5: The MSE of SVR and ENSI-SVR for NIR data

6.6 Discussion and Conclusion

In this chapter, we have proposed the variable selection for the single-index model (SIM) in order to achieve the dimensional reduction. The SIM consists of two parts, parametric and non-parametric, which make it combines the high accuracy and the flexibility. The key to the success of our proposed method is the use of the Elastic Net penalty in obtaining the significant parameters, which helps to get rid of redundant (noise) variables. Further, without the use of this penalty technique, the SIM model can not be applied when the number of predictor variables p is larger than sample size n. On the other hand, the unknown link function of the SIM is estimated using the support vector regression tool. The proposed method, ENSI-SVR is compared with fully parametric SVR method using three simulation examples and a real data set. Moreover, the same combinations of free parameters and sample sizes have been used in order to provide the same conditions for comparison. Finally, the comparative results showed the superiority of the proposed method, ENSISVR over existing SVR method to dispose of redundant variables and reduce the curse of high dimensionality.

CHAPTER 7

SUMMARY, CONCLUSIONS AND RECOMMENDATIONS FOR

FURTHER STUDIES

7.1 Introduction

This thesis focuses on linear and nonlinear regression models with low and high dimensional data in the presence of outlying observations. This chapter presents the contributions of the study, conclusions and some recommendations for future studies. The arrangement of this chapter is as follows: Section 7.2 summarizes the contribution of the study and its significance discovery. Section 7.3, illustrates the brief conclusion of the research objectives and the main issues. Finally, the possible further research topics for interested researchers who are working in this area are offered in Section 7.4.

7.2 Research Contributions

The overall conclusions of the current study are summarized based on the main contributions in the following subsections.

7.2.1 FP-SVR for Multiple Outliers and Bad Leverage Points Diagnostic in Linear and Non-Linear Regression Models

Jordaan and Smits (2004) suggested the standard support vector regression for outlier detection (SSVR) because of its significant advantages over the classical methods such as ease of adjust parameters and robustness with insensitivity function. This technique is applied for linear and nonlinear functions with multidimensional input.

Nishiguchi et al. (2010) stated that the standard SVR approach is difficult to put into practical use because of the high computational costs and difficulty of adjusting its parameters. This is due to inefficient way of adjusting the initial SVR parameters, which leads to high computational costs. Because of these drawbacks they proposed a practical new technique (?-?-SVR) to improve the performance of SSVR for detecting vertical outliers and leverage points in regression models. This new technique has ability to detect one outlier each iteration which increases the computation costs with an additional outlier. This procedure for outlier detection makes it suitable for small levels of contamination. In general, we can point out that both of these methods suffer some drawbacks: First they are not very successful for identifying outliers and leverage points. Second, they have high computational costs especially for high contamination levels. Third, the adjusting parameters are not subject to a clear base. Finally, they are not appropriate for the non-expert users. As remedial procedures to improve the performance of these approaches, we have proposed practical techniques to detect vertical outliers and bad leverage points for linear and nonlinear regression models in cases of full rank and rank deficient data that named the fixed parameters support vector regression (FPSVR) by taking in consideration the advantage of SVR robustness, the ease of controlling SVR parameters and the appropriate type of transformation. The results of real applications and simulation studies showed that the proposed methods have advantages over earlier SVR methods because they minimizes computation cost and have introduced fixed set of parameters, making them suitable for non-expert users. The clarification and the results of FP-SVR are presented in Chapter 3.

7.2.2 Modified GM-Estimator Based on FP-SVR for Data having Vertical Outliers and Bad Leverage Points

In the robust regression analysis, it is important to realize that only vertical outliers and bad leverage observations have large effects on the coefficient regression estimates, while good leverage points have no significant effects. To the best of our knowledge, not much work has been focused to develop estimation method that takes into account the difference between the good and bad observations. In this study, we proposed a new estimation technique called modified GM-estimator based on support vector regression (GM-SVR) to overcome the problem of the presence of outliers and bad leverage points in the data sets. The empirical examples and simulation studies pointed out that the proposed method has achieved high breakdown point, high efficiency and bounded influence function. The results which are presented in Chapter 4 show that the performance of the GM-SVR is the best overall, followed by GM6 for all possible combinations of size of samples and percentages of contamination. However, the OLS has the worst performance in which it has the highest standard deviation for coefficients and lowest model efficiency.

7.2.3 New SIM to Remedy the Problem of High Dimensionality in Linear and Non-Linear Regression Models

The Ordinary Least Squares (OLS) estimator suffers from some difficulties to satisfy the Markov assumptions in case of high dimensional data and not easy to adapt with it regardless of the nature of the relationship among variables linear or non-linear. As an alternative remedial technique, the support vector regression (SVR) is proposed to overcome this problem. However, there is no guarantee that the SVR performs well when the threshold value is small, near zero. To overcome this shortcoming, we propose to combine SVR with the single index model (SIM), which is considered a popular approach to handle the high dimensionality problem. To assess the performance of our proposed method SI-SVR, the comparison with the standard SVR is made in terms of the criterion of the mean squared error (MSE) using empirical and simulation examples. The real and simulation studies pointed out that the proposed method has the ability to address the problem of high dimensionality for different combinations of the parameters (and). Generally, the comparison results which are presented in Chapter 5 demonstrate the superiority of the proposed method, SI-SVR over the standard SVR method to reduce the problem of high dimensionality for all combinations of the parameters in both cases of linear and nonlinear target functions.

7.2.4 Elastic Net with SIM for Reducing Dimensionality when p is Larger than n for Linear and Non-Linear Regression Models

In the previous chapter, we have proposed to combine the SVR with the single index model (SIM), which is one of the most common techniques to address the problem of high dimensionality. This technique is proposed to avoid the case when the threshold value ? of the SVR model is small. However, this technique is facing significant challenge in some applications when the number of predictors p is much greater than the number of samples n. It is well known that the SIM is a semi-parametric model consists of two parts, parametric and non-parametric. The parametric part cannot be implemented when the matrix of predictors is singular. This has required us to develop a novel technique in this chapter that has the ability to overcome this shortcoming. Therefore, we have proposed the Elastic net for single index support vector regression model (ENSI-SVR) to achieve the dimension reduction. The comparison results obtained through Monte Carlo simulation experiments and the real case study show that the ENSI-SVR is an efficient method in dealing with sparse data to achieve dimension reduction which allows applying the SIM easily. The ENSISVR also performs excellently in comparison with the standard SVR in terms of achieving smallest values of the criterion of MSE for different combinations of the parameters in both cases of linear and nonlinear target functions.

7.3 Conclusion

In this thesis we have investigated and discussed the development of the diagnostic measures for the identification of vertical outliers and bad leverage points in the multiple linear and nonlinear regression models. Another main purpose of this study is to develop robust estimation method to remedy the problem of the presence of vertical outliers and bad leverage observations in the linear regression models. This thesis also discusses the formulation of nonparametric techniques for linear and nonlinear regression models to deal with the problem of the high dimensionality to achieve dimensional reduction.

To achieve these objectives, various simulation studies have been designed. The simulation studies are designed with different characteristics such as, size of samples, percentage of contaminants, number of independent variables and the type of relationships among variables. In order to evaluate our proposed methods in this study, they have been compared with some of the common existing methods using several comparison criteria such as SE, Loss function, VAR, MSE and the efficiency. In this thesis, four main objectives have been achieved which can be summarized as follows.

The first objective is to propose new diagnostic methods to be more efficient for the identification of multiple outliers and BLP in order to obtain the exact number of them with reducing the percentage of masking and swamping effects, which are called the fixed parameters support vector regression (FPSVR).

The second objective of this study is to develop a robust estimation method for linear regression models. In this respect, we proposed the modified GMestimator based on FP-SVR denoted as GM-SVR. The GM-SVR aims to improve the efficiency of a GM-estimator by using the high breakdown point LTSestimator as an initial estimate and utilizing more effective -weight function based on the FP-SVR to be more resistant for outliers and bad leverage observations. The merit of the proposed GM-SVR is that it has the ability to detect outliers and bad leverage points. Thus, it can minimize weights of bad leverage points and give fair weights to good leverage since outliers and bad leverage points are the most responsible for model failure.

The third objective is to propose a new estimation method for linear and nonlinear models which is denoted as SI-SVR to remedy the problem of high dimensionality. The proposed new technique employs the single index model to reduce the dimensionality without losing of information. The results of the study indicate that the proposed SI-SVR method can achieve dimension reduction efficiently compared with the existing standard SVR model.

Finally, the fourth objective is to propose the modified SI-SVR which is presented in previous objective to address the problem of high dimensionality when the number of predictors p greater than sample size n since it can not be implemented for singular predictor matrix. The proposed new technique which is denoted as ENSI-SVR employs the variable selection concept to allow for applying the SI-SVR for the linear and nonlinear models. Through the empirical and simulation studies we found that this suggested method has good performance to remedy the problem of high dimensionality in comparison with the standard SVR model.

7.4 Areas of Future Studies

It is well-known that the problems of outliers and high dimensionality have serious consequences in linear and nonlinear regression models. In literature, a number of studies were attempted to overcome these problems separately. In this respect, there are not much works that deal with the combined problems of outliers and high dimensionality including the rank deficient data. In addition, most existing robust methods address the problem of presence of outliers in the data set without classifying outliers to vertical outliers, good and bad leverage points. In this section, some future recommendations would be mentioned based on our current study, which can play an important role for the future researches. We can summarize the suggested topics for future studies as follows.

In Chapter 4, we proposed modified GM-estimator for linear models to handle the presence of outliers in the data set, by similar way a new robust method can be developed for nonlinear models.

Variable selection for Single index support vector regression model in case of high dimensional correlated data (the problem of multicollinearity).

REFERENCES

Algamal, Z. Y., ; Lee, M. H. (2015). Adjusted adaptive lasso in highdimensional poisson regression model. Modern Applied Science, 9(4), pp170-177.

Alguraibawi, M., Midi, H., ; Imon, A. (2015). A new robust diagnostic plot for classifying good and bad high leverage points in a multiple linear regression model. Mathematical Problems in Engineering. Volume 2015,

Article ID 279472, pp 1-12

An, C., ; Nguyen, T. Q. (2008). Statistical learning based intra prediction in H. 264. Image Processing, 2008. ICIP 2008. 15th IEEE International Conference on, 2800-2803.

Andersen, R. (2008). Modern methods for robust regression. Sage. Los Angeles.

Anderson, C., ; Schumacker, R. E. (2003). A comparison of five robust regression methods with ordinary least squares regression: Relative efficiency, bias, and test of the null hypothesis. Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences, 2(2), 79-103.

Andreou, P. C., Charalambous, C., ; Martzoukos, S. H. (2009). European option pricing by using the support vector regression approach. Artificial neural Networks–ICANN (pp. 874-883) Springer.

Andrews, D. F. (1974). A robust method for multiple linear regression. Technometrics, 16(4), 523-531.

Armstrong, R. D., ; Kung, M. T. (1978). Algorithm AS 132: Least absolute value estimates for a simple linear regression problem. Applied Statistics, , 363-366.

Bagheri, A., Midi, H., Ganjali, M., ; Eftekhari, S. (2010). A comparison of various influential points diagnostic methods and robust regression approaches: Reanalysis of interstitial lung disease data. Applied Mathematical Sciences, 4(28), 1367-1386.

Bao, Y., Lu, Y., ; Zhang, J. (2004). Forecasting stock price by SVMs regression. Artificial intelligence: Methodology, systems, and applications (pp. 295-303) Springer.

Barnett, V., ; Lewis, T. (1994). Outliers in statistical data. Wiley. New York.

Beaton, A. E., ; Tukey, J. W. (1974). The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics, 16(2), 147-185.

Belsley, D. A., Kuh, E., ; Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. John ; Wiley, New York.

Ben-Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B., ; Rätsch, G. (2008). Support vector machines and kernels for computational biology. PLoS Comput Biol, 4(10), e1000173.

Bermingham, M., Pong-Wong, R., Spiliopoulou, A., Hayward, C., Rudan, I., Campbell, H., Navarro, P. (2015). Application of high-dimensional feature selection: Evaluation for genomic prediction in man. Scientific Reports, 5(10312), pp 1-12

Boser, B. E., Guyon, I. M., ; Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory, 144-152.

Boyd, S., ; Vandenberghe, L. (2004). Convex optimization, Cambridge University Press. Cambridge.

Calafiore, G. C. (2000). Outliers robustness in multivariate orthogonal regression. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 30(6), 674-679.

Ceperic, V., Gielen, G., ; Baric, A. (2014). Sparse ?-tube support vector regression by active learning. Soft Computing, 18(6), 1113-1126.

Chan, W., Chan, C., Cheung, K., ; Harris, C. (2001). On the modelling of nonlinear dynamic systems using support vector neural networks.

Engineering Applications of Artificial Intelligence, 14(2), 105-113.

Chapelle, O., ; Vapnik, V. (1999). Model selection for support vector machines. Thirteenth Annual Neural Information Processing Systems (NIPS), Denver, USA, pp 230-236.

Chen, D. S., ; Jain, R. C. (1994). A robust backpropagation learning algorithm for function approximation. Neural Networks, IEEE Transactions on, 5(3), 467-479.

Cherkassky, V., ; Ma, Y. (2004). Practical selection of SVM parameters and noise estimation for SVM regression. Neural Networks, 17(1), 113-126.

Cherkassky, V., ; Mulier, F. M. (1998). Learning from data: Concepts, theory, and methods John Wiley ; Sons.

Cherkassky, V., ; Mulier, F. M. (2007). Learning from data: Concepts, theory, and methods John Wiley ; Sons.

Chuang, C., Su, S., Jeng, J., ; Hsiao, C. (2002). Robust support vector regression networks for function approximation with outliers. Neural Networks, IEEE Transactions on, 13(6), pp 1322-1330.

Coakley, C. W., ; Hettmansperger, T. P. (1993). A bounded influence, high breakdown, efficient regression estimator. Journal of the American Statistical Association, 88(423), 872-880.

Colliez, J., Dufrenois, F., ; Hamad, D. (2006). Robust regression and outlier detection with SVR: Application to optic flow estimation. The 17th British Machine Vision Association (BMVC), Edinburgh, UK, pp 1229-1238.

Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), pp 15-18.

Cortes, C., ; Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.

Cristianini, N., ; Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods Cambridge University press, Cambridge.

De Brabanter, K., De Brabanter, J., Suykens, J. A., ; De Moor, B. (2010). Optimized fixed-size kernel models for large data sets. Computational Statistics ; Data Analysis, 54(6), 1484-1504.

Dhhan, W., Rana, S., ; Midi, H. (2015). Non-sparse ?-insensitive support vector regression for outlier detection. Journal of Applied Statistics, 42(8), 17231739.

Draper, N. R., Smith, H., ; Pownell, E. (1966). Applied regression analysis Wiley New York.

Efron, B. (1992). Bootstrap methods: Another look at the jackknife. Springer, New York, pp 569-593.

Efron, B., Hastie, T., Johnstone, I., ; Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407-499.

Figueiredo, M. A. (2003). Adaptive sparseness for supervised learning. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 25(9), 1150-1159.

Friedman, J., Hastie, T., ; Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1-22.

Frohlich, H., ; Zell, A. (2005). Efficient parameter selection for support vector machines in classification and regression via model-based global optimization. Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, , 3 1431-1436.

Gandhi, A. B., Joshi, J. B., Jayaraman, V. K., & Kulkarni, B. D. (2007). Development of support vector regression (SVR)-based correlation for prediction of overall gas hold-up in bubble column reactors for various gas–liquid systems. Chemical Engineering Science, 62(24), 7078-7089.

Gervini, D., & Yohai, V. J. (2002). A class of robust and fully efficient regression estimators. Annals of Statistics. 30(2), 583-616.

Groß, J. (2003). Linear regression. Springer, Heidelberg, Germany.

Guo, B., Gunn, S. R., Damper, R. I., & Nelson, J. D. (2008). Customizing kernel functions for SVM-based hyperspectral image classification. Image Processing, IEEE Transactions on, 17(4), 622-629.

Guo, G., Zhang, J., & Zhang, G. (2010). A method to sparsify the solution of support vector regression. Neural Computing and Applications, 19(1), 115122.

Guyon, I., & Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3, 1157-1182.

Hadi, A. S. (1992). A new measure of overall potential influence in linear regression. Computational Statistics & Data Analysis, 14(1), 1-27.

Hampel, F., Ronchetti, E., Rousseeuw, P., & Stahel, W. (1986). Robust statistics, J. Wiley& Sons, New York,

Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346), 383-393.

Härdle, W. K., Hoffmann, L., & Moro, R. (2011). Learning machines supporting bankruptcy prediction. Statistical tools for finance and insurance. Springer, Heidelberg, pp. 225-250.

Härdle, W., Werwatz, A., Müller, M., & Sperlich, S. (2004). Nonparametric and semiparametric models.Springer, New York.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). Unsupervised learning. In The elements of statistical learning. Springer, New York, pp. 485-585.

Hawkins, D. M. (1980). Identification of outliers. Chapman and Hall, London.

Heinz, G., Peterson, L. J., Johnson, R. W., & Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education, 11(2), pp 225-233.

Hekimo?lu, S., & Erenoglu, R. C. (2013). A new GM-estimate with high breakdown point. Acta Geodaetica Et Geophysica, 48(4), 419-437.

Hill, R. W. (1977). Robust regression when there are outliers in the carriers, unpublished PhD thesis. Harvard University.

Hoerl, A. E., & Kennard, R. W. (1970a). Ridge regression: Applications to nonorthogonal problems. Technometrics, 12(1), 69-82.

Hoerl, A. E., & Kennard, R. W. (1970b). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55-67.

Horowitz, J. L. (2009). Semiparametric and nonparametric methods in econometrics. Springer, New York.

Hsu, C., Chang, C., & Lin, C. (2003). A practical guide to support vector classification technical report department of computer science and information engineering. National Taiwan University, Taipei,

Hu, Y., Gramacy, R. B., & Lian, H. (2013). Bayesian quantile regression for single-index models. Statistics and Computing, 23(4), 437-454.

Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1), pp 73-101.

Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and monte carlo. The Annals of Statistics,1(5) , pp 799-821.

Huber, P. J. (2011). Robust statistics. Springer, New York.

Hubert, M., Rousseeuw, P. J., & Van Aelst, S. (2008). High-breakdown robust multivariate methods. Statistical Science ,23(1) , 92-119.

Ichimura, H. (1993). Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics, 58(1), 71-120.

Jackson, D. A., & Chen, Y. (2004). Robust principal component analysis and outlier detection with ecological data. Environmetrics, 15(2), 129-139.

Jordaan, E. M., & Smits, G. F. (2004). Robust outlier detection using SVM regression. Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, , 3 2017-2022.

Kamruzzaman, M., & Imon, A. (2002). High leverage point: Another source of multicollinearity. Pakistan Journal of Statistics-All Series-, 18(3), pp 435-448.

Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3), 637-649.

Kuo, T., ; Yajima, Y. (2010). Ranking and selecting terms for text categorization via SVM discriminate boundary. International Journal of Intelligent Systems, 25(2), 137-154.

Kutner, M. H., Nachtsheim, C. J., Neter, J., ; Li, W. (2005). Applied linear statistical models. McGraw-Hill Irwin, New York.

Kwok, J. T. (2001). Linear dependency between ? and the input noise in ?support vector regression. Artificial neural Networks—ICANN 2001 (pp.

405-410) Springer.

Lahiri, S. K., ; Ghanta, K. C. (2009). Support vector regression with parameter tuning assisted by differential evolution technique: Study on pressure drop of slurry flow in pipeline. Korean Journal of Chemical Engineering, 26(5), 1175-1185.

Lee, Y., ; Mangasarian, O. L. (2001). SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20(1), 5-22.

Li, X., ; Kong, J. (2014). Application of GA–SVM method with parameter optimization for landslide development prediction. Natural Hazards and Earth System Science, 14(3), 525-533.

Liang, W., Zhang, L., ; Wang, M. (2011). The chaos differential evolution optimization algorithm and its application to support vector regression machine. Journal of Software, 6(7), 1297-1304.

Liebmann, B., Friedl, A., ; Varmuza, K. (2009). Determination of glucose and ethanol in bioethanol production by near infrared spectroscopy and chemometrics. Analytica Chimica Acta, 642(1), 171-178.

Liu, J. N., ; Hu, Y. (2013). Support vector regression with kernel mahalanobis measure for financial forecast: In Time series analysis, modeling and applications. Springer, Heidelberg, pp. 215-227.

Lu, C., Lee, T., ; Chiu, C. (2009). Financial time series forecasting using independent component analysis and support vector regression. Decision Support Systems, 47(2), 115-125.

Maronna, R. A., Martin, R. D., ; Yohai, V. J. (2006). Robust statistics. John Wiley Chichester.

Mattera, D., ; Haykin, S. (1999). Support vector machines for dynamic reconstruction of a chaotic system: In Advances in Kernel Methods, MIT Press, Cambridge, pp 211-241.

Mejía-Guevara, I., ; Kuri-Morales, Á. (2007). Evolutionary feature and parameter selection in support vector regression. MICAI 2007: Advances in artificial intelligence (pp. 399-408) Springer.

Mickey, M. R., Jean Dunn, O., ; Clark, V. (1967). Note on the use of stepwise regression in detecting outliers. Computers and Biomedical Research, 1(2), 105-111.

Montgomery, D. C., Peck, E. A., ; Vining, G. G. (2015). Introduction to linear regression analysis John Wiley ; Sons.

Muñoz-Garcia, J., Moreno-Rebollo, J., ; Pascual-Acosta, A. (1990). Outliers: A formal approach. International Statistical Review/Revue Internationale De Statistique, 58(3), pp 215-226.

Nakayama, H., ; Yun, Y. (2006). Support vector regression based on goal programming and multi-objective programming. In the 2006 IEEE International Joint Conference on Neural Network Proceedings, Vancouver, BC, Canada.

Nishiguchi, J., Kaseda, C., Nakayama, H., Arakawa, M., ; Yun, Y. (2010). Modified support vector regression in outlier detection. Neural Networks (IJCNN), the 2010 International Joint Conference on, 1-5.

Pell, R. J. (2000). Multiple outlier detection for multivariate calibration using robust statistical techniques. Chemometrics and Intelligent Laboratory Systems, 52(1), 87-104.

Peng, H., ; Huang, T. (2011). Penalized least squares for single index models. Journal of Statistical Planning and Inference, 141(4), 1362-1379.

Rahmatullah Imon, A. (2005). Identifying multiple influential observations in linear regression. Journal of Applied Statistics, 32(9), 929-946.

Rojo-Álvarez, J. L., Martínez-Ramón, M., Figueiras-Vidal, A. R., GarcíaArmada, A., ; Artés-Rodríguez, A. (2003). A robust support vector algorithm for nonparametric spectral analysis. IEEE Signal Processing Letters 10(11), pp 320-323.

Rosseuw, P., ; Van Zomeren, B. (1990). Unmasking multivariate outliers and leverage points (with discussion). J.Amer.Statist.Assoc, 85, 633-651.

Roth, V. (2004). The generalized LASSO. Neural Networks, IEEE Transactions on, 15(1), 16-28.

Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79(388), 871-880.

Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point.

Mathematical Statistics and Applications, 8, 283-297.

Rousseeuw, P. J., ; Croux, C. (1993). Alternatives to the median absolute deviation. Journal of the American Statistical Association, 88(424), 1273-1283.

Rousseeuw, P. J., ; Leroy, A. M. (1987). Robust regression and outlier detection. New York: John Wiley ; Sons.

Rousseeuw, P., ; Yohai, V. (1984). Robust regression by means of S-estimators. In Robust and nonlinear time series analysis, Springer, Heidelberg, pp 256-272.

Sato, J. R., Costafreda, S., Morettin, P. A., ; Brammer, M. J. (2008). Measuring time series predictability using support vector regression. Communications in Statistics—Simulation and Computation®, 37(6), 1183-1197.

Schölkopf, B., Burges, C. J., ; Smola, A. J. (1999). Advances in kernel methods: Support vector learning MIT press, Cambridge.

Scholkopf, B., ; Smola, A. (2002). Learning with kernels, MIT Press, Boston.

Shaowu, Z., Lianghong, W., Xiaofang, Y., ; Wen, T. (2007). Parameters selection of SVM for function approximation based on differential evolution. International Conference on Intelligent Systems and Knowledge Engineering 2007,

She, Y., ; Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494), pp 626639.

Simpson, J. R. (1995). New methods and comparative evaluations for robust and biased-robust regression estimation, unpublished PhD thesis, Arizona State University.

Smets, K., Verdonk, B., ; Jordaan, E. M. (2007). Evaluation of performance measures for SVR hyperparameter selection. In IEEE 2007 International Joint Conference on Neural Networks, Orlando, Florida, USA.

Smola, A., Murata, N., Schölkopf, B., ; Muller, K. (1998). Asymptotically optimal choice of ?-loss for support vector machines. In the ICANN 98. Springer London, UK.

Smola, A. J., ; Schölkopf, B. (2004). A tutorial on support vector regression.

Statistics and Computing, 14(3), 199-222.

Smola, A., ; Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 9, 155-161.

Smola, S. (1998). B.: A tutorial on support vector regression. NeuroCOLT technical. C Report NC-TR-98-030, Royal Holloway College, University of London, UK,

Stamey, T. A., Kabalin, J. N., McNeal, J. E., Johnstone, I. M., Freiha, F., Redwine, E. A., ; Yang, N. (1989). Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II. radical prostatectomy treated patients. The Journal of Urology, 141(5), 1076-1083.

Stromberg, A. J., Hössjer, O., ; Hawkins, D. M. (2000). The least trimmed differences regression estimator and alternatives. Journal of the American Statistical Association, 95(451), 853-864.

Suykens, J. A. (2001). Nonlinear modelling and support vector machines. In the 18th IEEE Instrumentation and Measurement Technology Conference (IMTC) 2001, Budapest, Hungary.

Suykens, J. A., De Brabanter, J., Lukas, L., ; Vandewalle, J. (2002). Weighted least squares support vector machines: Robustness and sparse approximation. Neurocomputing, 48(1), 85-105.

Tezcan, J., ; Cheng, Q. (2012). Support vector regression for estimating earthquake response spectra. Bulletin of Earthquake Engineering, 10(4), 1205-1219.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.Series B (Methodological), 58(1), pp 267-288.

Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1, 211-244.

Trevor, H., Robert, T., ; Jerome, F. (2001). The elements of statistical learning: Data mining, inference and prediction. New York: Springer-Verlag, 1(8), 371-406.

Ukil, A. (2007). Intelligent systems and signal processing in power engineering, Springer, Heidelberg.

Üstün, B., Melssen, W., Oudenhuijzen, M., ; Buydens, L. (2005). Determination of optimal support vector regression parameters by genetic algorithms and simplex optimization. Analytica Chimica Acta, 544(1), 292-305.

Üstün, B. (2003). A comparison of support vector machines and partial least squares regression on spectral data. Department of Analytical Chemistry, Radboud University Nijmegen, Unpublished Master’s Thesis,

Üstün, B., Melssen, W. J., & Buydens, L. M. (2006). Facilitating the application of support vector regression by using a universal pearson VII function based kernel. Chemometrics and Intelligent Laboratory Systems, 81(1), 29-40.

Vanderbei, R. J. (1999). LOQO user’s manual—version 3.10. Optimization Methods and Software, 11(1-4), 485-514.

Vapnik, V. (1995). The nature of statistical learning theory, 1st ed. Springer, New York.

Vapnik, V. (2000). The nature of statistical learning theory, 2nd ed. Springer, New York.

Vapnik, V. N. (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5), pp 988-999.

Vapnik, V. N., ; Vapnik, V. (1998). Statistical learning theory Wiley New York.

Vapnik, V., Golowich, S. E., ; Smola, A. (1996). Support vector method for function approximation, regression estimation, and signal processing. Advances in Neural Information Processing Systems, vol 9. MIT Press , p 281287.

Velleman, P. F., ; Welsch, R. E. (1981). Efficient computing of regression diagnostics. The American Statistician, 35(4), 234-242.

Wang, W., Xu, Z., Lu, W., ; Zhang, X. (2003). Determination of the spread parameter in the gaussian kernel for classification and regression. Neurocomputing, 55(3), 643-663.

Wang, X., Yang, C., Qin, B., ; Gui, W. (2005). Parameter selection of support vector regression based on hybrid optimization algorithm and its application. Journal of Control Theory and Applications, 3(4), 371-376.

Weisberg, S. (2005). Applied linear regression John Wiley ; Sons, Hoboken New Jersey.

Wilcox Rand, R. (2005). Introduction to robust estimation and hypothesis testing, Elsevier academic Press, New York.

Williams, G. (2011). Data mining with rattle and R: The art of excavating data for knowledge discovery, Springer, New York.

Williams, G., Hawkins, S., Gu, L., Baxter, R., ; He, H. (2002). A comparative study of RNN for outlier detection in data mining. Data Mining, IEEE International Conference on IEEE Computer Society, Maebashi, Japan.

Wu, T. Z., Yu, K., ; Yu, Y. (2010). Single-index quantile regression. Journal of Multivariate Analysis, 101(7), 1607-1621.

Yang, H., Huang, K., Chan, L., King, I., ; Lyu, M. R. (2004). Outliers treatment in support vector regression for financial time series prediction. In the 11th International Conference on Neural Information Processing, ICONIP 2004, Calcutta, India.

Yatchew, A. (2003). Semiparametric regression for the applied econometrician Cambridge University Press, Cambridge.

Yohai, V. J. (1987). High breakdown-point and high efficiency robust estimates for regression. The Annals of Statistics,15(2) , 642-656.

Yohai, V. J., ; Zamar, R. H. (1988). High breakdown-point estimates of regression by means of the minimization of an efficient scale. Journal of the American Statistical Association, 83(402), 406-413.

Yu, C., Chen, K., ; Yao, W. (2015). Outlier detection and robust mixture modeling using nonconvex penalized likelihood. Journal of Statistical Planning and Inference, 164, 27-38.

Zhou, X., ; Ma, Y. (2013). A study on SMO algorithm for solving ?-SVR with non-PSD kernels. Communications in Statistics-Simulation and Computation, 42(10), 2175-2196.

Zhu, G., Liu, S., ; Yu, J. (2002). Support vector machine and its applications to function approximation. JOURNAL-EAST CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY, 28(5), 555-559.

Zong, Q., Liu, W., ; Dou, L. (2006). Parameters selection for SVR based on PSO. The Sixth World Congress on Intelligent Control and Automation, WCICA 2006, Dalian, China.

Zou, H., ; Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320.

APPENDICES

Appendix A1

The Copper Content Data Set

Ind. X1 Y Ind. X1 Y Ind. X1 Y

1 1 2.20 11 11 3.10 21 21 3.70

2 2 2.20 12 12 3.37 22 22 3.77

3 3 2.40 13 13 3.40 23 23 5.28

4 4 2.40 14 14 3.40 24 24 28.95

5 5 2.50 15 15 3.40

6 6 2.70 16 16 3.50

7 7 2.80 17 17 3.60

8 8 2.90 18 18 3.70

9 9 3.03 19 19 3.70

10 10 3.03 20 20 3.70

Appendix A2

The Belgium Phone Calls Data Set

Ind. X1 Y Ind. X1 Y Ind. X1 Y

1 50 4.4 11 60 13.5 21 70 43.0

2 51 4.7 12 61 14.9 22 71 24.0

3 52 4.7 13 62 16.1 23 72 27.0

4 53 5.9 14 63 21.2 24 73 29.0

5 54 6.6 15 64 119

6 55 7.3 16 65 124

7 56 8.1 17 66 142

8 57 8.8 18 67 159

9 58 10.6 19 68 182

10 59 12.0 20 69 212

Appendix A3

48082206284342

The Hawkins, Brado and Kass Data Set

Ind. X1 X2 X3 Y Ind. X1 X2 X3 Y Ind. X1 X2 X3 Y

1 10.1 19.6 28.3 9.7 26 0.9 3.3 2.5 -0.8 51 2.3 1.5 0.4 0.7

2 9.5 20.5 28.9 10.1 27 3.3 2.5 2.9 -0.7 52 3.3 0.6 1.2 -0.5

3 10.7 20.2 31.0 10.3 28 1.8 0.8 2.0 0.3 53 0.3 0.4 3.3 0.7

4 9.9 21.5 31.7 9.5 29 1.2 0.9 0.8 0.3 54 1.1 3.0 0.3 0.7

5 10.3 21.1 31.1 10.0 30 1.2 0.7 3.4 -0.3 55 0.5 2.4 0.9 0.0

6 10.8 20.4 29.2 10.0 31 3.1 1.4 1.0 0.0 56 1.8 3.2 0.9 0.1

7 10.5 20.9 29.1 10.8 32 0.5 2.4 0.3 -0.4 57 1.8 0.7 0.7 0.7

8 9.9 19.6 28.8 10.3 33 1.5 3.1 1.5 -0.6 58 2.4 3.4 1.5 -0.1

9 9.7 20.7 31.0 9.6 34 0.4 0.0 0.7 -0.7 59 1.6 2.1 3.0 -0.3

10 9.3 19.7 30.3 9.9 35 3.1 2.4 3.0 0.3 60 0.3 1.5 3.3 -0.9

11 11.0 24.0 35.0 -0.2 36 1.1 2.2 2.7 -1.0 61 0.4 3.4 3.0 -0.3

12 12.0 23.0 37.0 -0.4 37 0.1 3.0 2.6 -0.6 62 0.9 0.1 0.3 0.6

13 12.0 26.0 34.0 0.7 38 1.5 1.2 0.2 0.9 63 1.1 2.7 0.2 -0.3

14 11.0 34.0 34.0 0.1 39 2.1 0.0 1.2 -0.7 64 2.8 3.0 2.9 -0.9

15 3.4 2.9 2.1 -0.4 40 0.5 2.0 1.2 -0.5 65 2.0 0.7 2.7 0.6

16 3.1 2.2 0.3 0.6 41 3.4 1.6 2.9 -0.1 66 0.2 1.8 0.8 -0.9

17 0.0 1.6 0.2 -0.2 42 0.3 1.0 2.7 -0.7 67 1.6 2.0 1.2 -0.7

18 2.3 1.6 2.0 0.0 43 0.1 3.3 0.9 0.6 68 0.1 0.0 1.1 0.6

19 0.8 2.9 1.6 0.1 44 1.8 0.5 3.2 -0.7 69 2.0 0.6 0.3 0.2

20 3.1 3.4 2.2 0.4 45 1.9 0.1 0.6 -0.5 70 1.0 2.2 2.9 0.7

21 2.6 2.2 1.9 0.9 46 1.8 0.5 3.0 -0.4 71 2.2 2.5 2.3 0.2

22 0.4 3.2 1.9 0.3 47 3.0 0.1 0.8 -0.9 72 0.6 2.0 1.5 -0.2

23 2.0 2.3 0.8 -0.8 48 3.1 1.6 3.0 0.1 73 0.3 1.7 2.2 0.4

24 1.3 2.3 0.5 0.7 49 3.1 2.5 1.9 0.9 74 0.0 2.2 1.6 -0.9

25 1.0 0.0 0.4 -0.3 50 2.1 2.8 2.9 -0.4 75 0.3 0.4 2.6 0.2

Appendix A4

The First word-Gesell Data Set

Ind. X1 Y Ind. X1 Y

1 15 95 12 9 96

2 26 71 13 10 83

3 10 83 14 11 84

4 9 91 15 11 102

5 15 102 16 10 100

6 20 87 17 12 105

7 18 93 18 42 57

8 11 100 19 17 121

9 8 104 20 11 86

10 20 94 21 10 100

11 7 113

Appendix A5

The Cloud Point Data Set

Ind. X1 Y Ind. X1 Y

1 0 22.1 11 2 26.1

2 1 24.5 12 4 28.5

3 2 26 13 6 30.3

4 3 26.8 14 8 31.5

5 4 28.2 15 10 33.1

6 5 28.9 16 0 22.8

7 6 30 17 3 27.3

8 7 30.4 18 6 29.8

9 8 31.4 19 9 31.8

10 0 21.9

Appendix A6

The Stack Loss Data Set

Ind. Air.Flow Water.Temp Acid.Conc Y =stack.loss

1 80 27 89 42

2 80 27 88 37

3 75 25 90 37

4 62 24 87 28

5 62 22 87 18

6 62 23 87 18

7 62 24 93 19

8 62 24 93 20

9 58 23 87 15

10 58 18 80 14

11 58 18 89 14

12 58 17 88 13

13 58 18 82 11

14 58 19 93 12

15 50 18 89 8

16 50 18 86 7

17 50 19 72 8

18 50 19 79 8

19 50 20 80 9

20 56 20 82 15

21 70 20 91 15

Appendix A7

The Aircraft Data Set

Ind. X1 X2 X3 X4 y

1 6.3 1.7 8176 4500 2.76

2 6.0 1.9 6699 3120 4.76

3 5.9 1.5 9663 6300 8.75

4 3.0 1.2 12837 9800 7.78

5 5.0 1.8 10205 4900 6.18

6 6.3 2.0 14890 6500 9.50

7 5.6 1.6 13836 8920 5.14

8 3.6 1.2 11628 14500 4.76

9 2.0 1.4 15225 14800 16.70

10 2.9 2.3 18691 10900 27.68

11 2.2 1.9 19350 16000 26.64

12 3.9 2.6 20638 16000 13.71

13 4.5 2.0 12843 7800 12.31

14 4.3 9.7 13384 17900 15.73

15 4.0 2.9 13307 10500 13.59

16 3.2 4.3 29855 24500 51.90

17 4.3 4.3 29277 30000 20.78

18 2.4 2.6 24651 24500 29.82

19 2.8 3.7 28539 34000 32.78

20 3.9 3.3 8085 8160 10.12

21 2.8 3.9 30328 35800 27.84

22 1.6 4.1 46172 37000 107.10

23 3.4 2.5 17836 19600 11.19

Appendix B

The Simulation Algorithm

The simulation algorithm of our thesis can be summarized by the next few steps.

Step 1: Simulate the predictor variables, based on the selected distribution such as the standard normal distribution or uniform distribution.

Step 2: Set the parameters of predictors, based on the suggested values in each example.

Step 3: Use the number of replications R=1000

Step 4: Simulate the additive error, based on any suggested distribution such as the standard normal distribution.

Step 5: Calculate the response variable, using the mentioned model in each example.

Step 6: Contaminate the variables with outliers (vertical outliers and bad leverage point) by the suggested values in each example.

Step 7: Apply the proposed method and calculate the comparison criteria.

Appendix C

R Programming Codes

######################################################################### Outliers Detection (FP-SVR) for RBF kernel function

#########################################################################

rm(list=ls()) library(kernlab)

wrapper ;- function(par) ksvm(y~x,type=”eps-svr” ,C=par1,epsilon=par2,kernel=”rbfdot”, scaled=F,kpar=list(sigma=par3)

,cross=par4)

k ;- 5 # k-fold cross validation cost ;- 10000 # cost parameter tube.epsilon ;- 0 # tube epsilon hyperpar ;- 1 # kernel parameter grid ;- expand.grid(cost,tube.epsilon, hyperpar,k)

output ;- Reduce(function(x,y)if([email protected] ; [email protected])

{x}else{y},apply(grid,1,wrapper)) svm.pred ;- predict(output,x) z;-abs(svm.pred) print(z)

plot(z,ylab= “abs(pred)”, main=” Title “,col=”blue”,pch=1, cex=1,panel.first=grid(col=”gray”), lwd=2)

st;-sqrt(pi*var(z)/(n*2)) CP;-2*median(z)+2*(st)

for(i in 1:n){ if(zi; CP)

print(i) }

abline(h=CP, col=”red”, lwd=3)

identify (x=1:n , y=z)

Outliers Detection (FP-SVR) for linear kernel function

#########################################################################

rm(list=ls()) library(kernlab)

wrapper ;- function(par) ksvm(y~x,type=”eps-svr”, C=par1,epsilon=par2,kernel=”polydot”, scaled=F,kpar=list(degree=par3),cross=par4)

k ;- 5 # k-fold cross validation cost ;- 10000 # cost parameter tube.epsilon ;- 0 # tube epsilon hyperpar ;- 1 # kernel parameter grid ;- expand.grid(cost,tube.epsilon,hyperpar,k) output ;- Reduce(function(x,y) if([email protected] ; [email protected])

{x}else{y},apply(grid,1,wrapper)) svm.pred ;- predict(output,x) z;-abs(y-svm.pred)

print(z)

plot(z,ylab= “abs(pred)”, main=” Title “,col=”blue”,pch=1, cex=1,panel.first=grid(col=”gray”), lwd=2)

v;-(pi*var(z)/(n*2)) CP;-2*median(z)+2*(v) for(i in 1:n){ if(zi; CP)

print(i) }

abline(h=CP, col=”red”, lwd=3) identify (x=1:n , y=z)

Modified GM estimator based on FP-SVR

#########################################################################

rm(list=ls()) library(kernlab)

wrapper ;- function(par) ksvm(y~x,type=”eps-svr” ,C=par1,epsilon=par2,kernel=”rbfdot”, scaled=F,kpar=list(sigma=par3)

,cross=par4)

k ;- 5 # k-fold cross validation cost ;- 10000 # cost parameter tube.epsilon ;- 0 # tube epsilon hyperpar ;- 1 # kernel parameter grid ;- expand.grid(cost,tube.epsilon, hyperpar,k)

output ;- Reduce(function(x,y)if([email protected] ; [email protected])

{x}else{y},apply(grid,1,wrapper)) svm.pred ;- predict(output,x) zi;-abs(svm.pred) st;-sqrt(pi*var(zi)/(n*2)) CP;-c(2*median(zi)+2*(st))

Gmwreg;-function(x, y, iter = 50, bend = 4.685, SEED = T)

{

xx ;- cbind(1, x) wsv ;- c(CP)/c(zi)

inw ;- c(ifelse ((wsv);1 , (wsv) ,1)) insvr ;-ltsReg(y~x) residw ;- insvr$residuals scale.w ;- 1.4826*(1+5/(n-p-1))* median(abs(residw)) for(it in 1:iter) {

tw ;- abs(residw/(scale.w*inw))

wtw;-c(ifelse (tw ;= bend ,(1-(tw/bend)^2)^2 ,0)) new.w ;- lsfit(x, y, wtw)

if(max(abs(new.w$coef – insvr$coef)) ; 0.0001) break insvr$coef ;- new.w$coef residw ;- new.w$residuals } residw ;- y – xx %*% new.w$coef if(max(abs(new.w$coef – insvr$coef)) ;= 0.0001) warning(paste(“failed to converge in”, iter, “steps”))

list(coef = new.w$coef, residuals = residw, w = wtw)}

GMW;- Gmwreg(x,y)

SI-SVR

#########################################################################

rm(list=ls()) set.seed(12) library(kernlab) library(np)

data ;- as.matrix(data.frame(y,x)) index ;- 1:nrow(data)

testindex ;- sample(index, trunc(length(index)/4)) x.train ;- na.omit(data-testindex,-1)

x.test ;- na.omit(datatestindex,-1)

y.train ;- na.omit(data-testindex,1)

y.test ;- na.omit(datatestindex,1)

wrapper ;- function(par) ksvm(y.train~x.train,type=”eps-svr”, C=par1,epsilon=par2,kernel=”rbfdot”, scaled=F,kpar=list(sigma=par3),cross=par4)

k ;- 5 # k-fold cross validation cost ;- c # cost parameter tube.epsilon ;- e # tube epsilon hyperpar ;- h # kernel parameter

grid ;- expand.grid(cost,tube.epsilon,hyperpar,k) output ;- Reduce(function(x,y) if([email protected] ; [email protected])

{x}else{y},apply(grid,1,wrapper)) pred.np ;- predict(output,x.test) resid.np ;- y.test-(pred.np) sse.np ;- ((resid.np)^2)

mse.np ;- mean(sse.np)

###################### SLS

bw ;- npindexbw(xdat=x, ydat=y, method=”ichimura”) b;-bw$beta xb.train;-x.train%*% b wrapper ;- function(par) ksvm(y.train~xb.train,type=”eps-svr”, C=par1,epsilon=par2,kernel=”rbfdot”, scaled=F,kpar=list(sigma=par3),cross=par4)

k ;- 5 # k-fold cross validation cost ;- c # cost parameter tube.epsilon ;- e # tube epsilon hyperpar ;- h # kernel parameter

grid ;- expand.grid(cost,tube.epsilon,hyperpar,k) output ;- Reduce(function(x,y) if([email protected] ; [email protected])

{x}else{y},apply(grid,1,wrapper)) xb.test;-x.test%*% b pred.sp ;- predict(output,xb.test) resid.sp;-y.test-(pred.sp) sse.sp;- (resid.sp)^2

mse.sp ;- mean(sse.sp)

######################################################################### ENSI-SVR

#########################################################################

rm(list=ls()) set.seed(12) library(kernlab) library(np) library(glmnet)

data ;- as.matrix(data.frame(y,x)) index ;- 1:nrow(data)

testindex ;- sample(index, trunc(length(index)/4)) x.train ;- na.omit(data-testindex,-1)

x.test ;- na.omit(datatestindex,-1)

y.train ;- na.omit(data-testindex,1)

y.test ;- na.omit(datatestindex,1)

wrapper ;- function(par) ksvm(y.train~x.train,type=”eps-svr”, C=par1,epsilon=par2,kernel=”rbfdot”, scaled=F,kpar=list(sigma=par3),cross=par4)

k ;- 5 # k-fold cross validation cost ;- c # cost parameter tube.epsilon ;- e # tube epsilon hyperpar ;- h # kernel parameter

grid ;- expand.grid(cost,tube.epsilon,hyperpar,k) output ;- Reduce(function(x,y) if([email protected] ; [email protected])

{x}else{y},apply(grid,1,wrapper)) pred.np ;- predict(output,x.test) resid.np ;- y.test-(pred.np) sse.np ;- ((resid.np)^2)

mse.np ;- mean(sse.np)

###################### ELS. Net… Single glm.net;-function(x,y){ cv.lasso.glm ;-cv.glmnet(x,y)

best.lambda.lasso=cv.lasso.glm$lambdawhich.min(cv.lasso.glm$cvm) lasso;-glmnet(x,y, family=c(“gaussian”),alpha=0.5, lambda= best.lambda.lasso,standardize=T) all.betahat.ll=t(t((lasso$beta),1)) names(all.betahat.ll);-1:length(all.betahat.ll) best.betahat.values.ll=as.numeric (all.betahat.ll) best.betahat.names.ll=as.numeric(names(all.betahat.ll)

which(lasso$beta!=0)) return(best.betahat.values.ll)} B;-glm.net(x,y) names.B;-1:length(B) nv;-as.numeric (names.B which(B!=0)) z;-x ,nv

z.train ;- na.omit(z-testindex,)

z.test ;- na.omit(ztestindex,)

###################### SLS

bw ;- npindexbw(xdat=z, ydat=y, method=”ichimura”) b;-bw$beta zb.train;-z.train%*% b wrapper ;- function(par) ksvm(y.train~zb.train,type=”eps-svr”, C=par1,epsilon=par2,kernel=”rbfdot”,scaled=F, kpar=list(sigma=par3),cross=par4) k ;- 5 # k-fold cross validation cost ;- c # cost parameter tube.epsilon ;- e # tube epsilon hyperpar ;- h # kernel parameter

grid ;- expand.grid(cost,tube.epsilon,hyperpar,k) output ;- Reduce(function(x,y) if([email protected] ; [email protected])

{x}else{y},apply(grid,1,wrapper)) zb.test;-z.test%*% b pred.sp ;- predict(output,zb.test) resid.sp;-y.test-(pred.sp) sse.sp;- (resid.sp)^2

mse.sp ;- mean(sse.sp)

######################################################################### ######################################################################### BIODATA OF STUDENT

Waleed Dhhan Sleabi was born on 18th of March 1979 in Babylon, Iraq. He is married with two kids. He received his bachelor degree in Statistics, Faculty of Administration and Economics at Al- Mustansiriya University, Baghdad, Iraq in 2001. He graduated Master of Statistics, Faculty of Administration and Economics at Al- Mustansiriya University, Baghdad, Iraq in 2010.

He started to work as auditor at Department of Accountant at Babylon Municipalities, Babylon, Iraq in 2003. Later at the beginning of 2008 his work has expanded to become the internal control over every activities of Babylon Municipalities.

He joined Universiti Putra Malaysia, Institute of Mathematical Research in September, 2013 to pursue his PhD in Statistics.

LIST OF PUBLICATIONS

Publications in Journals

A High Breakdown, High Efficiency and Bounded Influence Modified GM-estimator Based on Support Vector Regression. Published in Journal of Applied Statistics, (2016). DOI http://dx.doi.org/10.1080/02664763.2016.1182133.

A hybrid technique for selecting support vector regression parameters based on a practical selection method and grid search procedure. Published in Journal of Economic Computation and Economic Cybernetics Studies and Research, (2016), 50(2), pp 231-246.

Non-Sparse ?-insensitive Support Vector Regression for Outlier Detection. Published in Journal of Applied Statistics (2015), Vol. 42, pp: 1723-1739, 2015.

Articles submitted to Journals

Elastic Net for Single Index Support Vector Regression Model (2016), Journal of Statistical Computation and Simulation.

The Single-Index Support Vector Regression Model to Address the

Problem of High Dimensionality (2016), Journal of the Korean

Statistical Society.

3 The Support Vector Regression for Outlier Detection (2016), SORTStatistics and Operations Research Transactions.