Principal Components Analysis

Given a collection of points in a multidimensional space, a "best fitting" line can be defined as one that minimizes the average squared distance from a point to the line. The next best-fitting line can be similarly chosen from directions perpendicular to the first. Repeating this process yields an orthogonal basis in which different individual dimensions of the data are uncorrelated. These basis vectors are called principal components, and several related procedures Principal Component Analysis (PCA).

Screen Shot 2020-09-18 at 12.35.29 PM.png

Use in Machine Learning and AI

Principal Component Analysis (PCA) is a widely used technique in machine learning and AI for dimensionality reduction and feature extraction. PCA is a versatile tool in the machine learning and AI toolkit, primarily used for dimensionality reduction, feature extraction, and data preprocessing. Its ability to simplify complex datasets while preserving important information makes it valuable across a wide range of applications in the field.

Dimensionality Reduction

One of the primary uses of PCA in machine learning is to reduce the number of features in high-dimensional datasets:

Curse of Dimensionality: PCA helps mitigate the "curse of dimensionality" by reducing the number of features while preserving most of the important information.
Computational Efficiency: By reducing the number of dimensions, PCA can significantly speed up machine learning algorithms, especially when dealing with large datasets.
Noise Reduction: PCA can help eliminate noise in the data by focusing on the principal components that capture the most variance.

Feature Extraction and Selection

PCA is valuable for extracting meaningful features from complex datasets:

Uncovering Latent Patterns: PCA can reveal hidden patterns in the data that might not be apparent in the original feature space.
Feature Importance: By analyzing the eigenvalues associated with each principal component, data scientists can identify which features contribute most to the variance in the data.

Data Visualization

PCA is often used to visualize high-dimensional data:

2D and 3D Plotting: By reducing data to two or three principal components, complex datasets can be visualized in 2D or 3D plots, making it easier to identify clusters and patterns.

Preprocessing for Machine Learning Models

PCA is frequently used as a preprocessing step for other machine learning algorithms:

Improving Model Performance: By reducing dimensionality, PCA can help prevent overfitting and improve the generalization of machine learning models.
Multicollinearity Reduction: PCA can help address multicollinearity issues in regression models by creating orthogonal principal components.

Specific Applications

PCA finds applications in various AI and machine learning tasks:

Image Processing: In computer vision, PCA is used for tasks like facial recognition and image compression.
Natural Language Processing: PCA can be applied to reduce the dimensionality of word embeddings or document-term matrices.
Anomaly Detection: By identifying the principal components that explain most of the variance, PCA can help detect anomalies in data that deviate from these main patterns.
Time Series Analysis: PCA can be used to extract trends and patterns from multivariate time series data.

Considerations and Limitations

While PCA is powerful, it's important to be aware of its limitations:

Interpretability: The principal components, being linear combinations of original features, can be difficult to interpret in terms of the original variables.
Linearity Assumption: PCA assumes linear relationships between variables, which may not always hold in complex, real-world datasets.
Information Loss: While PCA aims to preserve most of the variance, some information is inevitably lost in the dimensionality reduction process.

Python Example

To download the code click here.

"""
dimensionality_reduction_using_scikit_learn.py
"""
# Import needed libraries.
import numpy as np
import matplotlib.pyplot as plotlib
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# Set parameters.
random_state = 0
test_data_proportion = 0.25
number_of_neighbors = 5
number_of_components = 2

# Load Digits dataset
X, y = datasets.load_digits(return_X_y=True)

# Split data into X/y train/test
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=test_data_proportion,
                                                    stratify=y,
                                                    random_state=random_state)

dim = len(X[0])
n_classes = len(np.unique(y))

# Reduce dimension to 2 with PCA
pca = make_pipeline(StandardScaler(),
                    PCA(n_components=number_of_components,
                        random_state=random_state))

# Reduce dimension to 2 with LinearDiscriminantAnalysis
lda = make_pipeline(StandardScaler(),
                    LinearDiscriminantAnalysis(n_components=number_of_components))

# Instantiate a k nearest neighbors classifier for evaluation.
knn = KNeighborsClassifier(n_neighbors=number_of_neighbors)

# Make a list of the methods to be compared
dim_reduction_methods = [('PCA', pca), ('LDA', lda)]

# plt.figure()
for i, (name, model) in enumerate(dim_reduction_methods):
    plotlib.figure()
    # plt.subplot(1, 3, i + 1, aspect=1)

    # Fit the method's model
    model.fit(X_train, y_train)

    # Fit a nearest neighbor classifier on the embedded training set
    knn.fit(model.transform(X_train), y_train)

    # Compute the nearest neighbor accuracy on the embedded test set
    acc_knn = knn.score(model.transform(X_test), y_test)

    # Embed the data set in 2 dimensions using the fitted model
    X_embedded = model.transform(X)

    # Plot the projected points and show the evaluation score
    plotlib.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, s=30, cmap='Set1')
    plotlib.title("{}, KNN (k={})\nTest accuracy = {:.2f}".format(name,
                                                                  number_of_neighbors,
                                                                  acc_knn))
plotlib.show()

Results are shown below:

Screen Shot 2020-09-18 at 12.35.24 PM.png