|

Comparative Analysis of Robust Modification Quality for Principal Component Analysis to Perform Correlated Data Compression

Authors: Goryainov V.B., Goryainova E.R. Published: 23.06.2021
Published in issue: #3(96)/2021  
DOI: 10.18698/1812-3368-2021-3-23-45

 
Category: Mathematics and Mechanics | Chapter: Computational Mathematics  
Keywords: robust principal component analysis, MCD estimate, Gnanadesikan --- Kettenring estimate, Olive --- Hawkins estimate

Principal component analysis is one of the methods traditionally used to solve the problem of reducing the dimensionality of a multidimensional vector with correlated components. We constructed the principal components using a special representation of the covariance or correlation matrix of the indicators observed. The classical principal component analysis uses Pearson sample correlation coefficients as estimates of the correlation matrix elements. These estimates are extremely sensitive to sample contamination and anomalous observations. To robustify the principal component analysis, we propose to replace the sample estimates of correlation matrices with well-known robust analogues, which include Spearman's rank correlation coefficient, Minimum Covariance Determinant estimates, orthogonalized Gnanadesikan --- Kettenring estimates, and Olive --- Hawkins estimates. The study aims to carry out a comparative numerical analysis of the classical principal component analysis and its robust modifications. For this purpose, we simulated nine-dimensional vectors with known correlation matrix structures and introduced a special metric that allows us to evaluate the quality of data compression. Our extensive numerical experiment has shown that the classical principal component analysis boasts the best compression quality for a Gaussian distribution of observations. When observations are characterised by a Student's t-distribution with three degrees of freedom, as well as when a cluster of outliers, individual anomalous observations, or symmetric contaminations described by the Tukey distribution are present in the data, it is the Gnanadesikan --- Kettenring and Olive --- Hawkins estimates modifying the principal component analysis that show the best compression quality. The quality of the classical principal component analysis and Spearman’s rank modification decreases in these cases

References

[1] Hubert M., Engelen S. Robust PCA and classification in biosciences. Bioinformatics, 2004, vol. 20, iss. 11, pp. 1728--1736. DOI: https://doi.org/10.1093/bioinformatics/bth158

[2] Hubert M., Rousseeuw P.J., Branden K.V. ROBPCA: a new approach to robust principal component analysis. Technometrics, 2005, vol. 47, iss. 1, pp. 64--79. DOI: https://doi.org/10.1198/004017004000000563

[3] Goryainova E.R., Shalimova Yu.A. Reducing the dimensionality of multivariate indicators containing non-linearly dependent components. Business Informatics, 2015, no. 3, pp. 24--33 (in Russ.).

[4] Wright J., Peng Y., Ma Y., et al. Robust principal component analysis: exact recovery of corrupted low-rank matrices by convex optimization. 22nd NIPS. ACM, 2009, pp. 2080--2088.

[5] Wilcox R.R. Robust principal components: a generalized variance perspective. Behav. Res., 2008, vol. 40, no. 1, pp. 102--108. DOI: https://doi.org/10.3758/BRM.40.1.102

[6] Maronna R. Principal components and orthogonal regression based on robust scales. Technometrics, 2005, vol. 47, no. 3, pp. 264--273.

[7] Croux C., Haesbroeck G. Principal component analysis based on robust estimators of the covariance or correlation matrix: influence functions and efficiencies. Biometrika, 2000, vol. 87, iss. 3, pp. 603--618. DOI: https://doi.org/10.1093/biomet/87.3.603

[8] Spearman C. The proof and measurement of association between two things. Am. J. Psych., 1904, vol. 15, no. 1, pp. 72--101. DOI: https://doi.org/10.2307/1412159

[9] Rousseeuw P.J., Leroy A.M. Robust regression and outlier detection. Wiley, 1987.

[10] Gnanadesikan R., Kettenring J.R. Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 1972, vol. 28, no. 1, Special Multivariate Issue, pp. 81--124. DOI: https://doi.org/10.2307/2528963

[11] Maronna R., Zamar R.H. Robust estimates of location and dispersion for high-dimensional datasets. Technometrics, 2002, vol. 44, iss. 4, pp. 307--317. DOI: https://doi.org/10.1198/004017002188618509

[12] Olive D.J. Robust multivariate analysis. Cham, Springer, 2017. DOI: https://doi.org/10.1007/978-3-319-68253-2

[13] Zhang J., Olive D.J., Ye P. Robust covariance matrix estimation with canonical correlation analysis. Int. J. Stat. Probab., 2012, vol. 1, no. 2, pp. 119--136. DOI: https://doi.org/10.5539/ijsp.v1n2p119

[14] Croux C., Garcia-Escudero L.A., Gordaliza A., et al. Robust principal component analysis based on trimming around affine subspaces. Stat. Sin., 2017, vol. 27, no. 3, pp. 1437--1459.

[15] Ivchenko G.I., Medvedev Yu.I. Vvedenie v matematicheskuyu statistiku [Introduction to mathematical statistics]. Moscow, LKI Publ., 2010.

[16] Aivazyan S.A., ed. Prikladnaya statistika. Klassifikatsia i snizheniye razmernosti [Applied statistics. Classification and dimension reduction]. Мoscow, Finansy i statistika Publ., 1989.

[17] Jolliffe I.T. Principal component analysis. Springer Series in Statistics. New York, Springer-Verlag, 2002. DOI: https://doi.org/10.1007/b98835

[18] Delvin S.J., Gnanadesikan R., Kettenring J.R. Robust estimation of dispersion matrices and principal components. J. Am. Stat. Assoc., 1981, vol. 76, no. 374, pp. 354--362.

[19] Polyak B.T., Khlebnikov M.V. Principle component analysis: robust versions. Autom. Remote Control, 2017, vol. 78, no. 3, pp. 490--506. DOI: https://doi.org/10.1134/S0005117917030092

[20] Goryainova E.R., Pankov A.P., Platonov E.N. Prikladnye metody analiza statis-ticheskikh dannykh [Applied methods of statistical data analysis]. Moscow, HSE Univ. Publ., 2012.

[21] Abdullah M.B. On a robust correlation coefficient. J. R. Stat. Soc. Ser. D, 1990, vol. 39, no. 4, pp. 455--460. DOI: https://doi.org/10.2307/2349088

[22] Cator E.A., Lopuhaa H.P. Asymptotic expansion of the minimum covariance determinant estimators. J. Multivar. Anal., 2010, vol. 101, iss. 10, pp. 2372--2388. DOI: https://doi.org/10.1016/j.jmva.2010.06.009

[23] Rousseeuw P.J., van Driessen K. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 1999, vol. 41, iss. 3, pp. 212--223. DOI: https://doi.org/10.2307/1270566

[24] Maronna R.A., Martin D., Yohai V. Robust statistics theory and methods. Wiley, 2006.

[25] Huber P.J. Robust statistics. Wiley, 1981.

[26] Olive D.J. A resistant estimator of multivariate location and dispersion. Comput. Stat. Data Anal., 2004, vol. 46, iss. 1, pp. 93--102. DOI: https://doi.org/10.1016/S0167-9473(03)00119-1

[27] Lopuhaa H.P. Asymptotics of reweighted estimators of multivariate location and scatter. Ann. Stat., 1999, vol. 27, iss. 5, pp. 1638--1665. DOI: https://doi.org/10.1214/aos/1017939145