If you want to get an idea about the concept and mathematics of PCA, you can go through this article. To understand the implementation of PCA from scratch (Python) and signs of principal components, I recommend you to read this article.
In this article, we are going to implement PCA using machine learning library. Now let's create the data sample (I'm going to use the data sample that was used in this article). But you have to create the data matrix such that the features are in columns and samples are in rows.
import numpy as np X = np.array ([ [2, 1, 1], [3, 4, 1], [5, 3, 4], [2, 4, 2] ])
After that we need to import the python library associated with PCA.
from sklearn.decomposition import PCA
Then we have to create a PCA model with the amount of variance we need to retain or number of principal components we need.
If we are going to define the amount of variance we need to retain,
# It will retain 0.9 (90%) of variance of data points pca = PCA (0.9)
Or if we are going to define the number of principal components we need,
# It will select 2 principal components only pca = PCA ( n_components = 2 )
Let's now fit our data points to the model. While giving your data matrix as input to
pca.fit() method, you have to make sure that features are in columns and samples are in rows.
After fitting, we have to transform the data points.
Z = pca.transform (X)
Here we can also do the fitting and transforming the data points using a single line of code.
Z = pca.fit_transform (X)
Let's see our code together.
import numpy as np from sklearn.decomposition import PCA X = np.array ([ [2, 1, 1], [3, 4, 1], [5, 3, 4], [2, 4, 2] ]) pca = PCA (0.9) pca.fit (X) Z = pca.transform (X)
This is the new data points we obtain,
Here in Z (4*2) matrix, each row represents a sample and each column represents a feature.
Now let's move to how to apply PCA on large data sets. For that we are going to use IRIS dataset which contains 150 samples of 3 classes (Setosa, Versicolor and Virginica) and each sample has 4 features (Sepal length, Sepal width, Petal length and Petal width).
For importing the IRIS dataset,
from sklearn import datasets IRIS = datasets.load_iris()
Here the data type of IRIS is 'bunch'. You can explore the IRIS dataset using following commands.
# To get the iris data # Each row of the data represents a sample and each column of the data represents a feature IRIS.data # To get the names of 4 features IRIS.feature_names # To get the target value of each sample IRIS.target # To get the names of 3 classes IRIS.target_names
Now lets store the data values and corresponding target values separately.
X = IRIS.data Y = IRIS.target
Whenever we apply an algorithm on the dataset, for evaluating the performance of that algorithm we have to split the dataset into train and test data.
from sklearn.model_selection import train_test_split # Splitting 70% of data into train data and 30% of data into test data X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.3)
When applying PCA, you have to make sure that
pca.fit() method which learns model parameters from train data, is applied on train data only. Because test data should be unseen data for the model. After using
pca.fit () method on train data, we can apply
pca.transform() method on both train data and test data.
from sklearn.model_decomposition import PCA # To retain 95% of variance pca = PCA (0.95) pca.fit (X_train) Z_train = pca.transform (X_train) Z_test = pca.transform (X_test)
Let's see our code together.
from sklearn import datasets from sklearn.model_decomposition import PCA from sklearn.model_selection import train_test_split IRIS = datasets.load_iris() X = IRIS.data Y = IRIS.target X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.3) pca = PCA (0.95) pca.fit (X_train) Z_train = pca.transform (X_train) Z_test = pca.transform (X_test)
If you see the shapes of Z_train and Z_test, they will be given by,
From that, you can see that 4 features of the samples have been reduced to 2 features. After that you can classify the data using a classifier.