Study of the Principal Components Method Modifications Resistance to Abnormal Observations
Authors: Goryainov V.B., Goryainova E.R. | Published: 22.05.2023 |
Published in issue: #2(107)/2023 | |
DOI: 10.18698/1812-3368-2023-2-17-34 | |
Category: Mathematics and Mechanics | Chapter: Computational Mathematics | |
Keywords: principal components method, correlation matrix robust evaluation, MCD estimate, Gnanadesikan --- Ketenring estimate, Olive --- Hawkins estimate, Tukey distribution, bimodal distribution |
Abstract
The paper considers the problem of reducing multidimensional correlated indicators. One of the approaches to solving this problem is based on the method of principal components, which makes it possible to compactly describe the vector with correlated coordinates (components) using the principal components vector with uncorrelated coordinates of much smaller dimension, while retaining most of the information about correlation structure of the original vector. On simulated and real data, several modifications of the principal components method were compared differing in the method of evaluating correlation matrix of the observation vector. The work objective is to demonstrate advantages of the robust modifications of the principal components method in cases, where data contained the abnormal values. To compare the considered modifications on the model data, metric was introduced that measured the difference between estimated and true eigenvalues of the initial data correlation matrix. This metric behavior depending on the probability distribution of observations was studied by computer simulation. As the distributions, multivariate distributions with the off-diagonal correlation matrices simulating a polluted sample were selected. Next, a sample of 13 correlated socioeconomic indicators for 85 countries was considered, where 46 abnormal values were identified. The considered modifications of the principal components method chose the same optimal number of principal components equal to three. However, the real data compression quality, which was defined as the share of the initial indicators total variance described by the first three principal components, turned out to be significantly higher for the robust modifications of the principal components method. Results obtained on these real data are in good agreement with conclusions of the computer simulation
Please cite this article in English as:
Goryainov V.B., Goryainova E.R. Study of the principal components method modifications resistance to abnormal observations. Herald of the Bauman Moscow State Technical University, Series Natural Sciences, 2023, no. 2 (107), pp. 17--34 (in Russ.). DOI: https://doi.org/10.18698/1812-3368-2023-2-17-34
References
[1] Ayvazyan S.A., ed. Prikladnaya statistika. Klassifikatsiya i snizhenie razmernosti [Applied statistics. Classification and dimension reduction]. Moscow, Finansy i statistika Publ., 1989.
[2] Jolliffe I.T. Principal component analysis. Springer Series in Statistics. New York, NY, Springer, 2002. DOI: https://doi.org/10.1007/b98835
[3] Huber P.J., Ronchetti E.M. Robust statistics. Wiley, 2009.
[4] Olive D.J. Robust multivariate analysis. Cham, Springer, 2017. DOI: https://doi.org/10.1007/978-3-319-68253-2
[5] Goryainov V.B., Goryainova E.R. Comparative analysis of robust modification quality for principal component analysis to perform correlated data compression. Herald of the Bauman Moscow State Technical University, Series Natural Sciences, 2021, no. 3 (96), pp. 23--45 (in Russ.). DOI: https://doi.org/10.18698/1812-3368-2021-3-23-45
[6] Goryainova E.R., Pankov A.P., Platonov E.N. Prikladnye metody analiza statisticheskikh dannykh [Applied metods of statistical data analysis]. Moscow, ID NIU VShE Publ., 2012.
[7] Rousseeuw P.J., Leroy A.M. Robust regression and outlier detection. Wiley, 1987.
[8] Cator E.A., Lopuhaa H.P. Asymptotic expansion of the minimum covariance determinant estimators. J. Multivar. Anal., 2010, vol. 101, iss. 10, pp. 2372--2388. DOI: https://doi.org/10.1016/j.jmva.2010.06.009
[9] Maronna R.A., Martin R.D., Yohai V.J., et al. Robust statistics. Theory and methods (with R). Wiley, 2019.
[10] Maronna R., Zamar R.H. Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 2002, vol. 44, iss. 4, pp. 307--317. DOI: https://doi.org/10.1198/00401700218861850
[11] Olive D.J. A resistant estimator of multivariate location and dispersion. Comput. Stat. Data Anal., 2004, vol. 46, no. 1, pp. 93--102. DOI: https://doi.org/10.1016/S0167-9473(03)00119-1
[12] Zhang J., Olive D.J., Ye S. Robust covariance matrix estimation with canonical correlation analysis. Int. J. Probab. Stat., 2012, vol. 1, no. 2, pp. 119--136. DOI: http://dx.doi.org/10.5539/ijsp.v1n2p119
[13] Goryainova E.R., Shalimova Yu.A. Reducing the dimensionality of multivariate indicators containing non-linearly dependent components. Biznes-informatika [Business Informatics], 2015, no. 3, pp. 24--33 (in Russ.).
[14] Maronna R. Principal components and orthogonal regression based on robust scales. Technometrics, 2005, vol. 47, iss. 3, pp. 264--273. DOI: https://doi.org/10.1198/004017005000000166
[15] Kotz S., Nadarajah S. Multivariate T-distributions and their applications. Cambridge Univ. Press, 2004.
[16] Razali N.M., Wah Y.B. Power comparisons of Shapiro --- Wilk, Kolmogorov --- Smirnov, Lilliefors, and Anderson --- Darling tests. JOSMA, 2011, vol. 2, no. 1, pp. 21--33.