Power system fault identification and localization using multiple linear regression of principal component distance indices

Received Jul 25, 2019 Revised Feb 19, 2020 Accepted Mar 3, 2020 This paper is focused on the application of principal component analysis (PCA) to classify and localize power system faults in a three phase, radial, long transmission line using receiving end line currents taken almost at the midpoint of the line length. The PCA scores are analyzed to compute principal component distance index (PCDI) which is further analyzed using a ratio based analysis to develop ratio index matrix (R) and ratio error matrix (RE) and ratio error index (REI) which are used to develop a fault classifier, which produces a 100% correct prediction. The later part of the paper deals with the development of a fault localizer using the same PCDI corresponding to six intermediate training locations, which are analyzed with tool like multiple linear regression (MLR) in order to predict the fault location with significantly high accuracy of only 87 m for a 150 km long radial transmission line.


INTRODUCTION
Electrical power transmission system is one of the most spatially extended technical systems, directly exposed to the environment and fairly often subjected to atmospheric hazards leading to different types of faults. Hence, power system stability, reliability, protection as well as regulated power flow has been prime topics of research. Identification and classification and localization of faults have been under in depth research since long. Prediction of fault location, especially in long transmission systems with high and very high voltage and large power systems is one of the most challenging works in the research area for the development of a robust power system protection algorithm. Hence, prompt detection of faults and classification along with precise fault location determination has been practiced by scientists in order to ensure system safety and stability. Supervised learning algorithms like the artificial neural network (ANN) along with probabilistic neural network (PNN) have a great impact in the area of identification and localization of the fault [1][2][3][4]. ANN sometimes is combined with other topologies like fuzzy logic in fault treatment [5]. Wavelet transformation and wavelet entropy has been extensively implemented successfully in fault analysis [6][7]. Wavelet transformation has often been combined with other methods like Adaptive neuro fuzzy inference system [8], genetic algorithm (GA) [9], principal component analysis (PCA) [10] etc.
Other analytical techniques include support vector machines which have significant contribution to the design of power system protection algorithm [11][12]. Dynamic phasors is another approach used for the analysis of faults in power system [13]. Principal component analysis (PCA), on the other hand, is a 115 line fault current is collected at a sampling frequency of 10 kHz, i.e., 2000 samples/cycle, thus the sample vector becomes an array containing 1500 data points for each type and for each phase. Hence, the three phases training data matrix corresponding to each fault type and carried out at 70 km, takes the dimension of 1500×3, i.e., with 1500 rows and 3 columns, for one type of training fault and illustrated as: Twelve such prototypes are analyzed in sequence to form the complete PCDI matrix of the dimension 12×3, denoted by P, each row of which correspond to the twelve fault cases and test condition and each column represents three individual phases. As mentioned before, PCA reconstructs a data set in the ascending order of importance and for the sake of ease of analysis, only two most important directions (PCs) and the corresponding score data are considered for the present purpose, hence used to construct PCDI matrix. These PCDI values are approximate estimation of the deviation of each fault current from healthy condition. The directions of variation is given by the eigenvectors obtained from the covariance matrix of the transformed data points or scores and the magnitudes of deviation from the origin (origin is assigned to the no fault condition) are given by the corresponding eigenvalues.

PCA algorithm applied for the proposed work
Step 1: Assign input data: the input data is taken as the phase identified matrices: [Xa], [Xb], and [Xc] each phase data in computed individually, say, denoted by jk where k takes the indices a,b, and c.
Step 2: Comput mean of Xk for each of the columns individually as: , where i indexes rows and takes the values 1 to 1500 and n indexes columns and takes the values 1 to 12.
Step 3: Subtract mean of each corresponding column from each of the rows of Jk for each column independently to form the modified joint matrix as: Jk MOD n =Jk n -(µ k ) n . Hence the dimension retains the same as 1500×12.
Step 4: Compute covariance matrix:   N 1 (Jk MOD n ) T (Jk MOD n ) Step 5: Calculate eigenvalues and eigenvectors of  Step 6: Pick few eigenvectors (d' <d) corresponding to the largest eigenvalues and put them in the column of A in descending order of eigenvalues i.
, where V1, V2 are the 1 st , 2 nd PCs and so on. For the proposed work, we have taken only the two largest eigenvector, hence, V1 andV2.
Step 7: Compute the new data matrix (PC scores) in reduced dimension: Jk MOD n (new)=A T Jk MOD n . Hence, the dimension of the score matrix Jk MOD n (new) should become the same as 12×1500. The proposed work uses only the two most significant directions. Hence, the working Jk MOD n (new) dimension reduces to a 12×2 which acts as the PC score matrix for the proposed work.
Step 8: Forming PCDI matrix: PCA distance is formed by finding out the vector distance of each of the training and the test score (2D) from the no-fault score (2D) which is the origin, thus forming PCDI matrix for each phase and producing 12×1 PCDI vector for each phase and the total PCDI 12×3 matrix considering all the three phases, say, denoted by S12×3. The top eleven rows of S correspond to the eleven different training conditions and each column represents the three individual phases and the twelfth row indicates that of the test condition, given by,  Further, the total PCDI matrix S is segmented into two matrices, viz. training PCDI matrix (denoted by P) and test PCDI matrix or vector (denoted by Q), hence reducing the 12×3 matrix into two matrices as given here: Further, similarity analysis has been carried out in order to compare the experimental data (Q vector) with the training fault signatures (P matrix) for each individual phases and find out the maximum similarity with any of the eleven different patterns, thus classifying the fault. It is observed that PCDIs vary following a certain pattern when computed for faults conducted at increasing geometric distance from the sending end, but the pattern for the three individual phases of PCDI remain identical. E.g., for DL-AB fault, the magnitudes of PCDI of phase A and B are very high, in comparison to phase C which remains almost zero for being the undisturbed phase. This pattern remains almost the same even with changing fault location. Besides, this rate of change in magnitude of PCDI of each phase is very much identical with increasing or decreasing fault locations. In order to establish the above inference mathematically, the PCDI of each phase is divided with the PCDI of the other phase which should remain almost the same regardless of the geometric distance of the fault as all the 3 phase PCDI vary almost in the same ratio on changing fault distances. The 3D Ratio Matrix (R) is hence formed using the 3D PCDI vector thus formed for each type of training fault and the test data, the elements of which are formed as follows [22]: where i represent the same indexing pattern. It is to be noted here that for a no-fault condition, PCDI of all the phases are zero and assigned as origin. Hence no-fault condition is identified by comparison of the PCDI directly with a very low constant value as mentioned later and the rest are used for the ratio analysis purpose.
[R] is further segmented into training and test matrices, as given by: The [Ratio TEST]vector will be similar to any of the ten fault prototypes defines by the ten rows of [Ratio TRAINING].In order to model this inference mathematically, a 3D ratio error matrix (RE) is formed using the [RatioTRAINING]and [RatioTEST] as: Finally, a column vector of ratio error index (REI) is found comparing the ratio error values of each type, the elements of which is given as:

Ratio error index (REI)i=Ratio Error 1i+Ratio Error 2i+Ratio Error 3i;
Quite understandably, the [REI]i will be minimum when the test and the corresponding training pattern match identically and this matrix is used to classify the fault by identifying the index i with the minimum possible REI value.Apart from these, two other threshold values Ɛ1 and Ɛ2 are selected, one being the upper threshold and the other being the lower one, based on the test data set found. The no fault condition is detected by direct comparison PCDI summation of the test data with the lower threshold as follows: PCDITEST sum=PCDI-A TEST+PCDI-B TEST+PCDI-C TEST; If PCDITEST sum is less than the lower threshold Ɛ1, it is identified as no fault due to the absence of any major disturbance in any phase, thus detecting no-fault. On the other way, a fault is detected for the same PCDITEST sum being higher than Ɛ1. DL faults are similarly found by comparing the ratio error index with that of the upper threshold Ɛ2 followed by direct analysis of [PCDI] and [R]. The entire analysis is well understood from the case study discussed in the next section. It is further observed that for DL faults, the directly unaffected phase remains almost undisturbed, whereas in case of DLG faults, some disturbance occur even in the directly unaffected line due to the involvement of ground and flow of zero sequence current through the ground and the grounded neutral of the transmission system, thus making a differentiation between the two types very clear. This inference is also observed from the PCDI matrix discussed in the case study model. Figure 3 elaborates the proposed algorithm in detail.

RESULTS AND ANALYSIS
A sample data set for any arbitrary fault is taken here for the purpose of case study and the same is processed through the proposed PCA algorithm to produce PCDI matrix as shown in the initial columns of Table1 which is a combined view of the [PCDI], [R], and [RE]. The [PCDI] is further represented graphically in the form of a three dimensional plot in Figure 4.Close observation of Figure 4 reveals that the PCI vector of the unknown type i.e. legend 9 is closest to the SLG-BG fault i.e., legend 3 compared to any other type with minimum Euclidian distance, which is further ascertained by forming [R] as shown in the middle columns of the same Table 1. [R] is again represented graphically in Figure 5.Close observation of [PCDI] and [R] reveal a certain distinguishing feature for each particular type of fault, i.e., the test fault PCDI values closely resemble that of SLG-B and this similarity is further boosted from Ratio values of the same, marked in bold letters. The same is observed from Figure 5 as well where the Euclidian distance between legend 3 and 9 is much less as compared to the same in Figure 4, thus ascertaining the test pattern to be SLG-B with MSE criteria. Thus, formation of R greatly emphasizes on the similarity between the test data and any one of the eleven sets of fault prototypes and this is also tested with varying fault location.  It is further observed that since the unaffected phase is least disturbed in case of a DL fault, accordingly indicated in the corresponding PCDI values, hence, the Ratio index of any one of the ratios is abruptly high for the DL faults. This is readily observed from Table 1 that, e.g., PCDI-C for DL-AB fault is very much low since phase C is the unaffected phase here, which when is used to form [R], Ratio 2 becomes abruptly high and this is reflected in [RE] as well as in the Ratio Error Index (REI) so formed and is shown in the final column of Table I. It shows that REI is hugely larger for DL faults in comparison to all the other prototypes. This key feature is used effectively to identify the DL faults from the rest and the upper threshold value Ɛ2 is set comparing all other fault types. For the given set of PCDI, it is well observed that ratio error index for faults other than DL faults is well below 100 and that for DL faults is way above it. Hence, Ɛ2 for this case can safely be set at 100. Hence, for the same reasons listed above, DL faults are not included to  Figure 5. It is further observed that even on varying the geometric fault distance from the sending end, the PCI vary following a particular pattern as described by the PCDI of Table2 where a typical fault data for SLG-BG fault, for example, is taken at different distances 10 km apart all throughout the entire span of 150 km long line. More importantly, it is observed that their mutual ratio remains very much similar, even with varying geometric distance (km) over the entire span of the transmission line as described by the three RI vector values of the same table. The above fact is also represented in Figure 6 which is constructed using the three phase [PCDI] and [R] values of Table 1 where, as described earlier, SLG-B fault is taken for example for different fault locations.    AG  BG  CG  AB  BC  CA  ABG BCG  CAG  ABC  PURE  13  0  0  0  0  0  0  0  0  0  0  AG  0  13  0  0  0  0  0  0  0  0  0  BG  0  0  13  0  0  0  0  0  0  0  0  CG  0  0  0

FAULT DISTANCE ESTIMATION
The later and another vital section of the proposed research is prediction of the fault location. The proposed fault distance predictor algorithm is designed using multiple linear regression (MLR) analysis. MLR takes into account the trends and curvatures of more than one data set and effectively compute one primary direction of variation using the multiple data set. The proposed work utilizes this important feature of MLR and uses the three phase features in terms of PCDI to form one key curvature, incorporating the features of all the PCDI. For this purpose, six intermediate non-equidistant locations at 10, 20, 50, 90, 130, and 140 km distance from the sending end of the 150 km long line have been chosen as the six training points for the proposed fault localizer algorithm. Ten different types of faults have been conducted at these six training locations and receiving end current waveforms have been recorded as the training data, each of which is fed to undergo the proposed fault classifier algorithm discussed in the previous section and the three phase PCDI are found for each of the six training points. This 3D training data set for each fault prototype is saved as a look up table and is scaled to unity for generalization and providing uniformity. Hence, the training data matrix, for each fault pattern takes the dimension of 6×3, called as training distance PCDI matrix afterwards and is given by Di as: where, i=1 to 10 define each of the ten training fault prototypes mentioned before and j=1 to 6 defines the six training geometric distances at 10, 20, 50, 90, 130, and 140 km respectively. Hence for the ten types of faults, there are ten such training distance PCDI matrices, together which forms the total training distance PCDI matrix given by DTRAINIING as: Post classification of the fault, the test PCDI matrix Q as found in the earlier section is saved. Next the Di matrix corresponding to the particular identified type with index iis taken up from DTRAINIING, followed by interpolation of the test Q vector from the corresponding Di using the Multiple Linear Regression (MLR) method in order to predict the geometric distance of the corresponding fault.

CASE STUDY AND ANALYSIS
A case study is shown here with SLG-A fault. The variation of receiving end line currents with varying geometric fault distance for SLG-A fault is shown in Figure 7. The same data is processed through the PCA algorithm to produce [PCDI] and consequence calculations. Table4 describes the absolute PCDI values and the corresponding scaled values for SLG-A fault at six training locations. The D SLG-A matrix is formed using the PCDI values as recorded in Table 4 using values from column 2, 3, and 4.
Similarly, D scaled SLG-A matrix is formed using values from column 5, 6, and 7 which on plotting against the respective fault geometric locations, reveal a curvilinear nature as shown in Figure 8. It is observed that each of the fault types show difference in curvature for three individual phases. Hence, the proposed scheme has been designed with multiple linear regression (MLR) for each prototype individually, which takes into account all the three phase PCDIs to produce a fairly accurate estimate of the fault location. The mathematical analysis of the MLR scheme adopted here is explained first following its application in designing the fault location prediction algorithm [23].   Figure 8. Geometric fault distance vs. PCDI (scaled) plot for three phase receiving end line currents for SLG-A fault at six training locations

APPLICATION OF MULTIPLE LINEAR REGRESSION (MLR)
Principal component analysis (PCA) as explained so far, itself is an important and effective tool in order to reduce a large number of multivariate data to a few primary directions of major variation. The three different phases of PCDI have difference in curvature which is well observed from Figure 10. This is further extended for all ten different fault patterns. The three phase PCDI for each pattern, is processed by the proposed MLR based scheme to achieve a single computed direction of variation, taking into account all the three curvatures from the three phases which is finally taken as the training data for the proposed fault distance predictor algorithm. Regression analysis is an important statistical tool to determine the relationship, called the regression function, between a dependent variable 'y', and a single or several independent variables 'xi'. Regression function also involves a set of unknown parameters 'bi', called the regression coefficients. A simple linear regression model is described as: Linear regression models with multiple independent variables are referred to as multiple linear models, a model of such representation is given as: y=b0+b1x1+b2x2+b3x3+…+bnxn (2) where n is the total number of independent variables.
The proposed algorithm uses scaled PCDI for the three phases as the input variable xi. Figure 10 reveals that the three phase scaled PCDI shows a curvilinear nature, rather than a straight line trend. Hence, to take into account this curvature of PCDI, the proposed algorithm is extended to multiple orders of these primary inputs xi depending on the Minimum Square Error (MSE) criteria. It is also to be mentioned here that the no of independent variables have clearly been taken depending upon the MSE criteria, and more so, the number of such variables vary from one fault type to another and also the order and type and interdependence, if any, among the variables. Thus, the proposed scheme is constructed as: The primary input variable is defined as, The idea is to train the proposed MLR based fault localization algorithm with the best fit arrangement, taking together all the three phases, although the maximum variation occurs in case of the directly affected line. The maximum order of X1i.e., index k has been assumed 12, i.e., twice the number to training locations, only to reduce computational complexity and the intermediate orders, i.e., the values of index k is set according to the MSE criteria. Thus, the complete input matrix is described by,  (8) which is a 3k×1 vector. In general, the coefficients of B are described as: The maximum number of input variables and each particular order are different for each ten training patterns and have been chosen depending in MSE criterion, producing different coefficient matrix for each training patterns. In a word, the non-linear nature of the PCDI has been scaled using MLR analysis. Equations (4) to (9) describe the MLR analysis for each fault pattern only, the complete equation of which can be given in matrix for as obtained from (4) as follows: In order to estimate the regression coefficients, a least square approach has been adopted; thus, the algorithm minimizes each of the errors described as: εi=∑(yi-b1xi1-b2xi2-b3xi3-… …-b3nxik) (11) which is found with all possible training values and this is minimized by setting  (13) and the residuals are given by, These residuals have been minimized following MSE criteria and the corresponding orders of polynomials for each type of training set has been achieved and stored in a look up table. Thus, each training pattern has different B vector having difference both in magnitudes, as well as in dimensionality. In order to test any unknown fault current, the proposed fault classifier algorithm based on the ratio analysis is applied first to identify the exact type of fault followed by fault distance prediction analysis using MLR as described.The three phase PC Indices corresponding to the experimental waveform have been analyzed using the same location prediction algorithm using the regression coefficient matrix (B) corresponding to the exact predicted fault type, as determined by the classifier, and the predicted location has been derived. The proposed algorithm is described in Figure 9 in the form of flowchart.

RESULTS OF FAULT LOCATIONS PREDICTOR
Table5 shows a summary of results by the proposed fault location predictor algorithm for ten different types of faults occurred at different locations, which shows that the proposed scheme produces an average deviation of 0.0871 km for the 150 km long transmission line which is well beyond satisfactory margin.

CONCLUSION
A simple and effective power system protection scheme for classification and distance prediction of long transmission line has been proposed here for a single end fed 400 kV, 50 Hz, 150 km long radial transmission line. Principal component analysis and multiple linear regression analysis has been adopted here to realize, design and implement the proposed protection scheme. Quarter cycle pre-fault and half cycle post-fault receiving end three phase fault current waveforms have been fed as the only input to the algorithm. PCA scores thus computed analyzing the input data have been used to construct principal component distance indices (PCDI) which are used to develop a ratio based algorithm to identify and classify faults.
Results show that the classifier shows 100% accuracy using only one set of training data taken almost at the midpoint of the line.
Thus the low training data is one of the key features of the proposed fault classifier. The scheme used PCA based analysis only instead of ANN aor Wavelet transform based approaches. ANN requires large training data and hence the training time is also very high. Wavelet analysis, on the other hand is computationally heavily burdened. Most of the other methods too have further complex analysis, which require higher time of computation. Simplicity of the scheme compared to some other existing methods and less computation time are other key features of the scheme. The proposed protection scheme is further extended to develop fault localizer algorithm. The average deviation of predicted fault location is only about 87.1 m. Hence, the proposed algorithm has high accuracy in determining power system fault locations as well. Accurate fault localization helps the personnel to identify the fault point fast and saves valuable time and effort to restore normal operation at the earliest. Thus the proposed protection scheme has all the qualities for the development of reliable transient-based power system protection unit.