Implementation of Linear Regression using Python package

The previous Day3 blog laid the foundation for Linear Regression. Now we can go ahead to get familiarised with some of the python libraries used for creating ML models. These packages have the logic of the algorithm already embedded and all we need to do is just import them in our code. The aim is to predict the housing price from the popular Boston housing price dataset(the dataset is already available in python).

Before jumping into the code, it’s nice to know a couple of statistical terms with an example.

Population — The complete set of data we are interested in. In practice, it is not possible to get the whole population as we cannot cover the entire universe.

Sample — A portion from the population which mirrors the characteristics of the population.

Mean(average) — The sum of all data points in the distribution(population/sample) divided by the total count of the data points.

X = [1, 5, 20, 30, 60] μ — population mean , x bar — sample mean

mean = sum of X / count of X = (1 + 5 + 20 + 30 + 60) / 5 = 23.2

Variance — How much each value in the distribution varies from the mean.

Variance formula

I would recommend checking this video if you are curious to know why we divide the sample variance by n-1 instead of n.

variance = (1–23.2)² + (5–23.2)² + (20–23.2)² + (30–23.2)² + (60–23.2)² /4 = 558.7

Population variance is represented by the symbol σ².

Standard deviation is the square root of the variance. s — sample standard deviation and σ — population standard deviation.

quantiles are the points that divide the observations into subgroups.

First quantile — (0 to 25%), Second quantile — (25.1% to 50%) median = 50%, third quantile — (51% to 75%) and the fourth quantile includes the rest.

Quantile split

Scikit-learn library in python has a lot of inbuilt packages for different machine learning algorithms and we will import those libraries for the subsequent processing. There is one strange term ‘r2_score’(will be discussed soon) in the metrics apart from the known mean_squared_error.

In order to measure the model performance, the complete dataset is split into train and test. The training set is for building(fit) the model and test dataset will be used for evaluating the prediction power of the model. During the fit process, the model compares the actual output and predicted output of the train data and learns to figure out the hidden relationship between independent(input) and dependent(target) variables.

Next are the two most frequently used libraries pandas and NumPy. They support to manipulate the data & store the data in a structured format. One of the key differentiating factors between the two is ‘pandas’ can hold multiple data types(both qualitative & quantitative) in a tabular format called DataFrame. Whereas Numpy is preferred for numerical fast computations and also it cannot handle heterogeneous data types.

We will explore the dataset to understand the content. Boston_dataset comes in the form of a dictionary (i.e) {key1:value1, key2:value2 …., keyn:valuen} the details exist as key, value pair. Note: In python ‘#’ represents commented line.

‘data’ => Input predictors

‘target’ => acutal output

‘feature_names’ => Names of each input predictor

‘DESCR’ => Complete description of the dataset including column definition

‘filename’ => path of the Boston file

Actual Input predictor values

In the shape command, the first value indicates the number of rows (i.e) count of input samples and the second value(columns) denotes the total count of input predictors.

Details of the Target/ dependent output

In total we have 506 target values corresponding to the input training samples. The column is ‘1’ even though it is not displayed. For 1-D arrays, the shape basically will give the total count.

Column names and file name path

Let’s create a DataFrame of input predictors by merging the feature names and feature values.

We can do some basic data explorations starting with displaying the first 5 records from the DataFrame using head() function.

The tabular format of the DataFrame

Since we merged the target to the input, we have 14 columns. The information info ()command shows the complete count of non-nullable values in each column and also the datatypes of each field. In python, all the index values start at 0 and for the very same reason, the Data# begins with 0.

Information about the DataFrame
datatypes display and duplicate data check

Descriptive statistics of the dataset can be understood by issuing describe() command. It would display mean, standard deviation, median, max value, min value and the different quantiles.

Display of Statistical information

Dividing the aggregated dataset into input and output. In python, axis=0 symbolizes row, while axis=1 represent columns.

Splitting the dataset into train(0.85%) and test(0.15%). The random state helps to maintain the same split every time the code is run. If the random state is not given, then the data split will not be consistent, it keeps changing and thus making it difficult to measure the accuracy for every execution.

Using the inbuilt algorithm, we can build(fit) the model and predict the output for test data. We can observe that during training both input and actual output are given so that the model learns and corrects itself by comparing the actual output with the predicted value.

Predicted output

The final step is to evaluate the accuracy of the model. The metrics package in python comes handy for assessing the model fitness.

Model Evaluation

r2_score is another way of expressing the precision of the ML model. It expresses how much variance in the input data has been explained by the model. If all the variance is addressed, then actual value is the same as predicted one which makes SSres to 0 and thus r2_score becomes 1(which is the desired highest score).

SSres = mean_squared_error between predicted and actual or residue

SStot = variance within the input data

r2_score = 1 — SSres/SStot

R2 depiction taken from wiki

AI Enthusiast | Blogger✍