Python Simple Linear Regression
Source: Coursera Machine Learning with Python
Uses the following libraries:
- NumPy
- Matplotlib
- Pandas
- Scikit-learn
See more about skikit linear regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Import libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
Load the data
url= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv"
Explore the data (df.describe())
Select some features (cdf = df'ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS')
Sample the data (cdf.sample(9))
Visualise the data:
viz = cdf'CYLINDERS','ENGINESIZE','FUELCONSUMPTION_COMB','CO2EMISSIONS'
viz.hist()
plt.show()
Scatter plots
plt.scatter(cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("FUELCONSUMPTION_COMB")
plt.ylabel("Emission")
plt.show()
plt.scatter(cdf.ENGINESIZE, cdf.CO2EMISSIONS, color='blue')
plt.xlabel("Engine size")
plt.ylabel("Emission")
plt.xlim(0,27)
plt.show()
Extract the input features
X = cdf.ENGINESIZE.to_numpy()
y = cdf.CO2EMISSIONS.to_numpy()
Create train and test datasets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
type(X_train), np.shape(X_train), np.shape(X_train)
Build a simple linear regression model
from sklearn import linear_model
# create a model object
regressor = linear_model.LinearRegression()
# train the model on the training data
# X_train is a 1-D array but sklearn models expect a 2D array as input for the training data, with shape (n_observations, n_features).
# So we need to reshape it. We can let it infer the number of observations using '-1'.
regressor.fit(X_train.reshape(-1, 1), y_train)
# Print the coefficients
print ('Coefficients: ', regressor.coef_[0]) # with simple linear regression there is only one coefficient, here we extract it from the 1 by 1 array.
print ('Intercept: ',regressor.intercept_)
Visualise model outputs
plt.scatter(X_train, y_train, color='blue')
plt.plot(X_train, regressor.coef_ * X_train + regressor.intercept_, '-r')
plt.xlabel("Engine size")
plt.ylabel("Emission")
Evaluate the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# Use the predict method to make test predictions
y_pred = regressor.predict(X_test.reshape(-1,1))
# Evaluation
print("Mean absolute error: %.2f" % mean_absolute_error(y_test, y_pred))
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
print("Root mean squared error: %.2f" % np.sqrt(mean_squared_error(y_test, y_pred)))
print("R2-score: %.2f" % r2_score(y_test, y_pred))