Unsupervised Machine Learning

Unsupervised machine learning is a class of techniques used to identify patterns, structures and groupings (clusters) within datasets. Crucially, unlike supervised machine learning, it does not require prior labelling.

There are two primary pillars of unsupervised machine learning: Clustering and Dimension Reduction.

unsupervised learning infographic.webp

Clustering Overview

Types of Clustering Methods

Applications of Clustering

The main purpose of clustering in machine learning is to group similar data points together based on their characteristics or features. This helps in identifying patterns and structures within the data without prior labeling. Key purposes include:

Clustering is essential for understanding the underlying structure of data and making informed decisions based on those insights.

The differences between k-means clustering and hierarchical clustering are as follows:

K-Means Clustering

Hierarchical Clustering

In summary, k-means is efficient for large datasets with a fixed number of clusters, while hierarchical clustering provides a more flexible approach to explore data relationships without prior specification of cluster numbers.

If you use k-means clustering on non-spherical clusters, several issues may arise:

Overall, k-means is not well-suited for non-spherical clusters, and alternative clustering methods, such as density-based clustering (e.g., DBSCAN) or hierarchical clustering, may be more effective in such scenarios.

Non-spherical clusters

Non-spherical clusters are groups of data points that do not form a round or spherical shape when visualised. Instead, they can take on various forms, such as elongated, irregular, or even complex shapes. This is particularly important in clustering because many traditional clustering algorithms, like k-means, assume that clusters are spherical and evenly sized, which can lead to inaccurate results when dealing with real-world data.

To illustrate this concept, imagine a group of friends standing in a park. If they are all standing in a circle, that represents a spherical cluster. However, if they are scattered in a line or in a more complex formation, like a star shape, that represents a non-spherical cluster. Density-based clustering algorithms, such as DBSCAN, are often used to identify these non-spherical clusters because they can adapt to the shape of the data, allowing for more accurate grouping of points based on their density rather than their distance from a central point.

K-Means

The content focuses on K-Means Clustering, an iterative, centroid-based algorithm used to partition datasets into similar groups.

Understanding K-Means Clustering

Using the K-Means Algorithm

Challenges and Limitations

Determining the Optimal Number of Clusters (K)

The main objective of the K-Means algorithm is to minimize the within-cluster variance for all clusters simultaneously. This means that the algorithm aims to ensure that data points within each cluster are as similar as possible (i.e., close to the centroid), while also maximizing the dissimilarity between different clusters.

In mathematical terms, this involves minimizing the sum of the squared distances between each data point and its corresponding cluster centroid.

In K-Means clustering, centroids play a crucial role as they represent the center of each cluster. Here are the key functions of centroids:

Overall, centroids are essential for defining the structure of the clusters and guiding the iterative process of the K-Means algorithm.

The distance matrix in K-Means clustering is significant for the following reasons:

Overall, the distance matrix is fundamental to the K-Means algorithm, enabling it to partition the dataset into meaningful clusters based on spatial relationships.

Determining the optimal value of K in K-Means clustering can be challenging, but several techniques can help:

  1. Elbow Method:
    • Plot the sum of squared distances (inertia) for different values of K.
    • Look for the "elbow" point in the graph, where the rate of decrease sharply changes. This point suggests a suitable K value.
  2. Silhouette Analysis:
    • Calculate the silhouette score for different K values, which measures how similar a data point is to its own cluster compared to other clusters.
    • A higher silhouette score indicates better-defined clusters, helping to identify the optimal K.
  3. Davies-Bouldin Index:
    • This index evaluates the average similarity ratio of each cluster with its most similar cluster.
    • A lower Davies-Bouldin index indicates better clustering, guiding the selection of K.

Using these methods can provide insights into the most appropriate number of clusters for your dataset.

If K is set too high in K-Means clustering, several issues may arise:

If K is set too low in K-Means clustering, several problems may occur:

DBSCAN and HDBSCAN Clustering

DBSCAN and HDBSCAN, which are used for density-based spatial clustering.

DBSCAN Overview

HDBSCAN Overview

Comparison and Applications

The main differences between DBSCAN and HDBSCAN are:

These differences make HDBSCAN more effective in complex datasets with varying densities and noise.

In DBSCAN, a core point is defined as a point that has at least a specified minimum number of neighboring points (including itself) within a given radius (epsilon). Specifically:

Core points are essential for forming clusters, as they serve as the focal points from which clusters are expanded by including their neighboring points.

In DBSCAN, border points play a crucial role in the clustering process:

Overall, border points help define the shape and extent of clusters while indicating areas where the density of points decreases.

HDBSCAN, which stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise, is an advanced clustering algorithm that builds upon the principles of DBSCAN. Unlike its predecessor, HDBSCAN does not require you to set specific parameters, making it more user-friendly and adaptable to various data sets. Imagine trying to find groups of friends in a crowded room without knowing how many groups there are; HDBSCAN helps you do just that by automatically adjusting to the density of the crowd.

Here's how it works: HDBSCAN starts by treating each data point as its own cluster, similar to how you might initially see every person in the room as an individual. It then gradually merges these clusters based on their density, creating a hierarchy of clusters. This process allows HDBSCAN to identify clusters of varying shapes and sizes, even in the presence of noise or outliers. The result is a more coherent and meaningful representation of your data, where clusters are defined not just by their proximity but also by their stability across different density levels.

Clustering, Dimension Reduction, and Feature Engineering

Clustering Techniques

Dimension Reduction

Application in Face Recognition

Principal Component Analysis (PCA) plays a crucial role in dimension reduction by transforming high-dimensional data into a lower-dimensional space while preserving as much variance as possible. Here are the key points regarding PCA's role:

In summary, PCA is a powerful technique for reducing the complexity of data while retaining its important information, making it a valuable preprocessing step in machine learning tasks.

Clustering plays a significant role in feature selection by helping to identify and group similar or correlated features. Here are the key points regarding its significance:

In summary, clustering aids in feature selection by simplifying the feature space, enhancing model performance, and providing insights into feature relationships.

Clustering can significantly enhance feature engineering decisions in the following ways:

In summary, clustering aids in making informed feature engineering decisions by revealing patterns, guiding transformations, and simplifying the feature space, ultimately leading to better model performance and interpretability.

If you don't reduce dimensions before clustering, several challenges may arise:

In summary, not reducing dimensions before clustering can lead to ineffective clustering results, increased computational demands, and difficulties in interpretation and visualization.

Dimension Reduction Algorithms

Dimension Reduction Algorithms are essential for simplifying high-dimensional datasets while preserving critical information.

Dimension Reduction Algorithms

Principal Component Analysis (PCA)

T-Distributed Stochastic Neighbor Embedding (t-SNE) and UMAP

If PCA fails or is not suitable for your data, you can consider the following alternative dimensionality reduction methods:

  1. T-Distributed Stochastic Neighbor Embedding (t-SNE):
    • Focuses on preserving local similarities in high-dimensional data.
    • Effective for visualizing complex datasets, especially in two or three dimensions.
    • However, it may struggle with scalability and requires careful tuning of hyperparameters.
  2. Uniform Manifold Approximation and Projection (UMAP):
    • Constructs a high-dimensional graph representation of the data and optimizes a low-dimensional structure.
    • Preserves both local and global data structures, often providing better clustering performance than t-SNE.
    • Generally scales better than t-SNE and is suitable for larger datasets.

These methods can be particularly useful when dealing with non-linear relationships in the data that PCA may not capture effectively.

Python K-means Clustering
Python Comparing DBSCAN and HDBSCAN
Python Principle Component Analysis (PCA)
Python t-SNE and UMAP