Machine Learning
Machine Learning is a subset of AI that uses mathematics to find patterns in large datasets, enabling computers to make decisions without explicit programming. Whilst not new, Machine Learning has surged in interest in recent years due to the availability of big data and the increases in cloud computing.
There are three key approaches for Machine Learning: Supervised Learning, Unsupervised Learning and Reinforcement Learning.
- Supervised learning
- Regression
- Classification
- Unsupervised learning
- Clustering
- Reinforcement laerning
- Learning by doing
The machine learning lifecycle is:
- Problem formation and understanding
- Data collection and preparation
- Model training and testing
- Model deployment and maintenance
The Oxford course sets it out as:
- Manage data
- Train model
- Evaluate model
- Deploy model
Coursera sets it out as:
- Problem Definition
- Data Collection
- Data Preparation
- Model Development and Evaluation
- Model Deployment
Problem formation
- Identify suitable problems (not everything is suitable for machine learning. Simple use cases with small datasets and where rule can easily coded are not good candidates)
- Define inputs and outputs (Clearly defining the questions to be answered, the inputs and the expected outputs will inform algorithm selection)
- Evaluate data (Ensure the data is suitable, and investigate the relationships between features and the target)
- Integrate outputs (Plan how the model's predictions will be used to provide value)
- Set success metrics (define metrics to measure the success of the outputs)
Training models
There are sources of pre-trained models, such as Model Zoo, AWS Marketplace, and Hugging Face. These are already trained on large datasets, so can be used for similar problems, saving time. The Transfer Learning technique can be used to build on these pre-trained models with additional data.
If you can't find a pre-trained model that fits, then training a custom model is necessary. The training of complex models will likely require GPUs, which can be access via cloud providers like AWS. Amongst other languages, Python can be used to train models, using key librarie like Pandas for data structures, NumPy for numerical computing, Matplotlib and Seaborn for data visualisation and scikit-learn for machine learning algorithms.
Understanding the data
Use data visualisation tools such as Matplotlib and Seaborn. Histograms show the distribution of numerical data, allowing for outliers to be identified. Heatmaps show correlations between features, helping to identify and remove duplicate features. Removing highly correlated features (dimensionality reduction) can also improve the runtime and effectiveness of the model.
Feature engineering
The data can be manipulated by adding, removing and combining features to improve model prediction capabilities. Missing data can be handled by deleting rows with missing values, predicting missing values, replacing missing values with the mean, or using algorithms such as KNN (K-nearest neighbours)
New features can be created by combining correlated features, such as rooms per household and population per household. This means the original features can be removed.
Categorical data can be converted to numeric using one hot encoding, creating binary columns for each possible value.
Performance evaluation
- Accuracy: Measures the fraction of correct predictions out of the total predictions. It's not ideal for imbalanced datasets.
- Precision: Quantifies the number of true positive predictions out of all positive predictions made. Useful when you want fewer false positives.
- Recall: Measures the number of true positive predictions out of all actual positives. Important when you want fewer false negatives.
- F1 Score: Combines precision and recall into a single metric. Effective for imbalanced data.
- AUC (Area Under the ROC Curve): Visualizes how well predictions are ranked across true positive and false positive rates. Optimize on AUC if false positives are a concern.
A Confusion Matrix summarises the prediction results of a classification model into true positives, false positives, false negatives and true negatives, thereby helping to understand the accuracy.
Regression metrics include:
- R Squared: Measures the difference between actual values and predictions, with values closer to 1 indicating a better fit.
- Mean Squared Error (MSE): An absolute measure of how much predicted results deviate from actual numbers.
- Root Mean Squared Error (RMSE): The square root of MSE, making it easier to interpret.
- Mean Absolute Error (MAE): Represents the sum of all differences between actual and predicted values, divided by the total number of predictions.
- Model Performance: Lower MAE and RMSE values indicate better model performance, while higher values suggest the need for improvement.
Examples of algorithms
- Supervised learning
- Linear regression
- LASSO and ridge regression
- Logistic regression
- Decision tree
- Random decision forests
- Support Vector machines (SVM)
- Unsupervised learning
- k-means
- Naive Bayesian Classifier
- Reinforcement Learning
- Q-learning
- Policy gradient
- Deep Learning
- Convolutional neural networks (CNN)
- Recurrent neural networks (RNN)
- Encoders and transformers
Sources:
- https://www.linkedin.com/learning-login/share?forceAccount=false&redirect=https%3A%2F%2Fwww.linkedin.com%2Flearning%2Fartificial-intelligence-foundations-machine-learning-22345868%3Ftrk%3Dshare_ent_url%26shareId%3DijJKBy%252FVRb%252Bt7Y0J2U0Q9w%253D%253D
- Said Business School Oxford AI Course, Module 2