Machine Learning

Machine Learning is a subset of AI that uses mathematics to find patterns in large datasets, enabling computers to make decisions without explicit programming. Whilst not new, Machine Learning has surged in interest in recent years due to the availability of big data and the increases in cloud computing.

There are three key approaches for Machine Learning: Supervised Learning, Unsupervised Learning and Reinforcement Learning.

The machine learning lifecycle is:

The Oxford course sets it out as:

Coursera sets it out as:

Problem formation

Training models

There are sources of pre-trained models, such as Model Zoo, AWS Marketplace, and Hugging Face. These are already trained on large datasets, so can be used for similar problems, saving time. The Transfer Learning technique can be used to build on these pre-trained models with additional data.

If you can't find a pre-trained model that fits, then training a custom model is necessary. The training of complex models will likely require GPUs, which can be access via cloud providers like AWS. Amongst other languages, Python can be used to train models, using key librarie like Pandas for data structures, NumPy for numerical computing, Matplotlib and Seaborn for data visualisation and scikit-learn for machine learning algorithms.

Understanding the data

Use data visualisation tools such as Matplotlib and Seaborn. Histograms show the distribution of numerical data, allowing for outliers to be identified. Heatmaps show correlations between features, helping to identify and remove duplicate features. Removing highly correlated features (dimensionality reduction) can also improve the runtime and effectiveness of the model.

Feature engineering

The data can be manipulated by adding, removing and combining features to improve model prediction capabilities. Missing data can be handled by deleting rows with missing values, predicting missing values, replacing missing values with the mean, or using algorithms such as KNN (K-nearest neighbours)

New features can be created by combining correlated features, such as rooms per household and population per household. This means the original features can be removed.

Categorical data can be converted to numeric using one hot encoding, creating binary columns for each possible value.

Performance evaluation

A Confusion Matrix summarises the prediction results of a classification model into true positives, false positives, false negatives and true negatives, thereby helping to understand the accuracy.

Regression metrics include:

Examples of algorithms

Sources: