Implementation of PCA using ML library (Python)
If you want to get an idea about the concept and mathematics of PCA, you can go through this article. To understand the implementation of PCA from scratch (Python) and signs of principal components, I recommend you to read this article.
In this article, we are going to implement PCA using machine learning library. Now let's create the data sample (I'm going to use the data sample that was used in this article). But you have to create the data matrix such that the features are in columns and samples are in rows.
import numpy as np
X = np.array ([ [2, 1, 1], [3, 4, 1], [5, 3, 4], [2, 4, 2] ])
After that we need to import the python library associated with PCA.
from sklearn.decomposition import PCA
Then we have to create a PCA model with the amount of variance we need to retain or number of principal components we need.
If we are going to define the amount of variance we need to retain,
# It will retain 0.9 (90%) of variance of data points
pca = PCA (0.9)
Or if we are going to define the number of principal components we need,
# It will select 2 principal components only
pca = PCA ( n_components = 2 )
Let's now fit our data points to the model. While giving your data matrix as input to pca.fit()
method, you have to make sure that features are in columns and samples are in rows.
pca.fit (X)
After fitting, we have to transform the data points.
Z = pca.transform (X)
Here we can also do the fitting and transforming the data points using a single line of code.
Z = pca.fit_transform (X)
Let's see our code together.
import numpy as np from sklearn.decomposition import PCA X = np.array ([ [2, 1, 1], [3, 4, 1], [5, 3, 4], [2, 4, 2] ]) pca = PCA (0.9) pca.fit (X) Z = pca.transform (X)
This is the new data points we obtain,
Here in Z (4*2) matrix, each row represents a sample and each column represents a feature.
Now let's move to how to apply PCA on large data sets. For that we are going to use IRIS dataset which contains 150 samples of 3 classes (Setosa, Versicolor and Virginica) and each sample has 4 features (Sepal length, Sepal width, Petal length and Petal width).
For importing the IRIS dataset,
from sklearn import datasets
IRIS = datasets.load_iris()
Here the data type of IRIS is 'bunch'. You can explore the IRIS dataset using following commands.
# To get the iris data
# Each row of the data represents a sample and each column of the data represents a feature
IRIS.data
# To get the names of 4 features
IRIS.feature_names
# To get the target value of each sample
IRIS.target
# To get the names of 3 classes
IRIS.target_names
Now lets store the data values and corresponding target values separately.
X = IRIS.data
Y = IRIS.target
Whenever we apply an algorithm on the dataset, for evaluating the performance of that algorithm we have to split the dataset into train and test data.
from sklearn.model_selection import train_test_split
# Splitting 70% of data into train data and 30% of data into test data
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.3)
When applying PCA, you have to make sure that pca.fit()
method which learns model parameters from train data, is applied on train data only. Because test data should be unseen data for the model. After using pca.fit ()
method on train data, we can apply pca.transform()
method on both train data and test data.
from sklearn.model_decomposition import PCA
# To retain 95% of variance
pca = PCA (0.95)
pca.fit (X_train)
Z_train = pca.transform (X_train)
Z_test = pca.transform (X_test)
Let's see our code together.
from sklearn import datasets from sklearn.model_decomposition import PCA from sklearn.model_selection import train_test_split IRIS = datasets.load_iris() X = IRIS.data Y = IRIS.target X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.3) pca = PCA (0.95) pca.fit (X_train) Z_train = pca.transform (X_train) Z_test = pca.transform (X_test)
If you see the shapes of Z_train and Z_test, they will be given by,
From that, you can see that 4 features of the samples have been reduced to 2 features. After that you can classify the data using a classifier.