PCA from Scratch (Python) & Signs of Principal Components

Before going into the implementation of PCA, if you are not clear about the concept of PCA or mathematics behind PCA, you can read this article (Even if you don't have any doubt, still I encourage you to have a look on the data sample and data points derived from it using PCA as I am going to apply PCA on the same data sample here).

We are going to apply PCA on the data sample that was used previously. To create that data matrix, we have to import following library using,

import numpy as np

While creating the data matrix using numpy library, you have to make sure that your features are in rows and samples are in columns.

X = np.array ([ [2, 3, 5, 2], [1, 4, 3, 4], [1, 1, 4, 2] ]) 

After that, we have to find the mean of each features (i.e. each rows).

# Here by defining axis = 1, we can get the mean value of each row
mu = np.array (np.mean (X, axis = 1))

Now we have to subtract the mean value of each feature from the corresponding feature of each sample.

# It will give mean matrix of size (3*4)
mean = np.array ([mu,]*4).transpose()
A = X - mean

For finding the covariance matrix C of matrix A,

C = np.cov ( A )

Let's find the eigen values and corresponding normalized eigen vectors of covariance matrix C.

eig_vals, eig_vecs = np.linalg.eig ( C )

We now have to sort the eigen values and their corresponding eigen vectors in order of decreasing eigen values.

index = eig_vals.argsort()[::-1]  
eigen_values = eig_vals[index]
eigen_vectors = eig_vecs[:,index]

Following are obtained eigen values and eigen vectors,

Eigen values and eigen vectors obtained by manually calculating are,

(You now may have a question on your mind about the sign differences in the eigen vectors obtained here and obtained by manually calculating. At the end of the article, your question will be answered)

After that we have to select the number of principal components. Now I am going to select the number of principal components such that it retains more than 0.9 (90%) of variance of data (you can select whatever the amount of variance you want to retain).

k = 0
sum_of_k_ev = 0
while (sum_of_k_ev / eigen_values.sum()) <= 0.9 :
    sum_of_k_ev = sum_of_k_ev + eigen_values[k]
    k = k + 1

For selecting number of principal components (i.e. eigen vectors),

V = eigen_vectors [:, 0 : k ]

Finally we now have to project our data on selected principal components.

Z = np.dot( V.T, A )

Let's see our codes together.

import numpy as np

X = np.array ([ [2, 3, 5, 2], [1, 4, 3, 4], [1, 1, 4, 2] ]) 

mu = np.array (np.mean (X, axis = 1))

mean = np.array ([mu,]*4).transpose()
A = X - mean

C = np.cov ( A )

eig_vals, eig_vecs = np.linalg.eig ( C )

index = eig_vals.argsort()[::-1]  
eigen_values = eig_vals[index]
eigen_vectors = eig_vecs[:,index]

k = 0
sum_of_k_ev = 0
while (sum_of_k_ev / eigen_values.sum()) <= 0.9 :
    sum_of_k_ev = sum_of_k_ev + eigen_values[k]
    k = k + 1

V = eigen_vectors [:, 0 : k ]

Z = np.dot( V.T, A )

Here the new data matrix Z (2*4) we obtain is,

In this Z matrix you can notice that there are slight changes in the second decimal place of some values while comparing to the Z matrix we obtained by manually calculating (you can see the Z matrix obtained in the previous article about PCA). Because when we did manually we approximated the values of the normalized eigen vectors to 2 decimal places.

Also you may wonder about the sign ( + / - ) differences in the first feature of all samples. They just indicate the change in the direction of first principal component. For better understanding, let's plot the principal components using the eigen vectors we obtained from coding and manually calculating. (Below figures were plotted using this website)

From above figures, you can clearly see the change in the direction of first principal component. Because of that, if we plot both the data points on the principal component axes, we can easily notice the switching of positive and negative sides of the first principal component axis.

To indicate the difference in the direction, there are sign differences in the data points obtained from coding and manually calculating.

In the next article, we will see how to implement PCA using machine learning library and how to apply PCA on large datasets.

Related articles