Multiple linear regression (MLR) is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. A linear regression model that contains more than one predictor variable is called a multiple linear regression model. The goal of multiple linear regression (MLR) is to model the relationship between the explanatory and response variables.

The model for MLR, given n observations, is:

Let’s take an example:

The dataset has 5 columns which contains extract from the Profit and Loss statement of 50 start up companies. This tells about the companies R&D, Admin and Marketing spend, the state in which these companies are based and also profit that the companies realized in that year. A venture capitalist (VC) would be interested in such a data and would to see if factors like R&D Spend, Admin expenses, Marketing spend and State has any role to play on the profitability of a startup. This analysis would help VC to make investment decisions in future.

Profit is the dependent variable and other variables are independent variables.

**Dummy Variables**

Let’s look at the dataset we have for this example:

One challenge we would face while building the linear model is on handling the State variable. State column has a categorical value and can not be treated as like any other numeric value. We need to add dummy variables for each categorical value like below:

Add 3 columns for each categorical value of state. Add 1 to the column where row value of state matches to the column header. Row containing New York will have 1 against the column header New York and rest of the values in that column will be zero. Similarly, we need to modify California and Florida columns too. Three additional columns that we added are called dummy variables and these will be used in our model building. State column can be ignored. We can also ignore New York column from analysis because row which has zero under California and Florida implicitly implies New York will have a value of 1. We always use 1 less dummy variable compared to total factors to avoid dummy variable trap.

Python code:

import pandas as pdimport matplotlib.pyplot as pltimport numpy as np#Datasetdata_df = pd.read_csv(“https://raw.githubusercontent.com/swapnilsaurav/MachineLearning/master/3_Startups.csv”)#Getting X and Y valuesX = data_df.iloc[:, :-1].valuesy = data_df.iloc[:, -1].values #Encoding the categorical variables:from sklearn.preprocessing import LabelEncoder, OneHotEncoderlabelencoder_X = LabelEncoder() #Change the text into numbers 0,1,2 – 4th columnX[: ,3]= labelencoder_X.fit_transform(X[: ,3])#create dummy variablesfrom sklearn.compose import ColumnTransformertransformer = ColumnTransformer([( ‘one_hot_encoder’, OneHotEncoder(), [3])],remainder=‘passthrough’)#Now a little fit and transformX = np.array(transformer.fit_transform(X), dtype=np.float)#4 Avoid the dummy variables trapX= X[:, 1:]#Delete the first column represent the New York #Split into training and test setfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) #Train the Algorithmfrom sklearn.linear_model import LinearRegressionregressor = LinearRegression() regressor.fit(X_train, y_train) y_pred = regressor.predict(X_test) #The y_pred is a numpy array that contains all predicted valuesoutput_df = pd.DataFrame({#compare actual output values for X_test with predicted values ‘Actual’: y_test, ‘Predicted’: y_pred})print( “Actual v Predicted: \n“,output_df)####import numpy as npfrom sklearn import metricsexplained_variance=metrics.explained_variance_score(y_test, y_pred) mean_absolute_error=metrics.mean_absolute_error(y_test, y_pred) mse=metrics.mean_squared_error(y_test, y_pred) mean_squared_log_error=metrics.mean_squared_log_error(y_test, y_pred) median_absolute_error=metrics.median_absolute_error(y_test, y_pred) r2=metrics.r2_score(y_test, y_pred) print( ‘Explained_variance: ‘, round(explained_variance,2))print( ‘Mean_Squared_Log_Error: ‘, round(mean_squared_log_error,2))print( ‘R-squared: ‘, round(r2,4))print( ‘Mean Absolute Error(MAE): ‘, round(mean_absolute_error,2))print( ‘Mean Squared Error (MSE): ‘, round(mse,2))print( ‘Root Mean Squared Error (RMSE): ‘, round(np.sqrt(mse),2))from statsmodels.api import OLSimport statsmodels.api as sm#In our model, y will be dependent on 2 values: coefficienctX = sm.add_constant(X)# and constant, so we need to add additional column in X for #constant value summ = OLS(y, X).fit().summary() print( “Summary of the dataset: \n“,summ) |

*Output:*

In above table, x1 and x2 are the dummy variables for state, x3 is R&D, x4 is Administration, x5 is the marketing spends.

*How many independent variables to consider?*

We need to be careful to choose which ones we need to keep for input variables. We do not want to include all the variables for mainly 2 reasons:

- GIGO: If we feed garbage to our model we will get garbage out so we need to feed in right set of data
- Justifying the input: Can we justify the inclusion of all the data, if no, then we should not include them.

There are 4 methods to build a multiple linear model:

- Select all in
- Backward Elimination
- Forward Selection
- Bidirectional Elimination

** Select-all-in**: We select all the independent variables because we know that all variables impact the result or you have to because business leaders want you to include them.

**Backward Elimination**:

- Select a significance level to stay in the model (e..g. SL =0.05, higher P value to be removed)
- Fit the full model with all possible predictors.
- Consider the predictor with the highest P-value. If P>SL, go to step 4 otherwise goto 5
- Remove the predictor and refit the model and Go to step 3
- Your model is ready!

**Forward Selection**:

- Select a significance level to stay in the model (e..g. SL =0.05, lower P value to be kept)
- Fit all the simple regression models, Select the one with the lowest P-value.
- Keep this variable and fit all possible models with one extra predictor added to the ones you already have. Now Run with 2 variable linear regressions.
- Consider the predictor with the lowest P-value. If P<SL, go to Step 3, otherwise go to next step.
- Keep the previous model!

**Bi-directional Selection**: It is a combination of Forward selection and backward elimination:

- Select a significant level to enter and stay in the model (SLE = SLS = 0.05)
- Perform the next step of Forward selection (new variables must have P<SLE)
- Perform all the step of Backward elimination (old variables must have P<SLS)
- Iterate between 2 & 3 till no new variables can enter and no old variables can exit.

In the multiple regression example since we have already executed with all the attributes, let’s implement backward elimination method here and remoe out the attributes that are not useful for us. Let’ have a relook at the stats summary:

Look at the highest p-values and remove it. In this condition x2 (second dummy variable has the highest one (0,990). Now, we will remove this variable from the X and re-run the model.

X_opt= X[:, [0,1,3,4,5]] regressor_OLS=sm.OLS(endog = y, exog = X_opt).fit() summ =regressor_OLS.summary() print( “Summary of the dataset after elimination 1: \n“,summ) |

*Output Snapshot:*

Look at the highest p-value again. #First dummy variable, x1’s p-value is 0.940. Remove this one. Even though this appeared as high number in the previous step also, but as per the algorithm we need to remove only 1 value at a time. Since, removing an attribute can have impact on other attributes also. Re-run the code again:

X_opt= X[:, [0,3,4,5]] regressor_OLS=sm.OLS(endog = y, exog = X_opt).fit() summ = regressor_OLS.summary() print( “Summary of the dataset after elimination 2: \n“,summ) |

Admin spends (x2) has the highest p-value (0.602). Remove this as well.

X_opt= X[:, [0,3,5]] regressor_OLS=sm.OLS(endog = y, exog = X_opt).fit() summ = regressor_OLS.summary() print( “Summary of the dataset after elimination 3: \n“,summ) |

Admin spends (x2) has the highest p-value (0.06). This value is low but since we have selected the significance level (SL) as 0.05, we need to remove this as well.

X_opt= X[:, [0,3]] regressor_OLS=sm.OLS(endog = y, exog = X_opt).fit() summ =regressor_OLS.summary() print( “Summary of the dataset after elimination 3: \n“,summ) |

Finally, we see that only one factor has the significant impact on the profit. The highest impact variable is R&D spendings on profit of these startups. The accuracy of the model has also increased. When we included all the attributes, the R squared value was 0.9347 and now its at 0.947.

The word “linear” in “multiple linear regression” refers to the fact that the model meets all the criteria discussed in the next section.

## Test for Linearity

The next question we need to understand is when can we perform or not perform Linear Regression. In this section, let’s understand the assumptions of linear regression in detail. One of the most essential steps to take before applying linear regression and depending solely on accuracy scores is to check for these assumptions and only when a dataset meet these assumptions, we say that dataset can be used for linear regression model.

For the analysis, we will take the same dataset, we used for Multiple Linear Regression Analysis in the previous section.

import pandas as pd#Datasetdata_df = pd.read_csv(“https://raw.githubusercontent.com/swapnilsaurav/MachineLearning/master/3_Startups.csv”) |

Before we apply regression on all these attributes, we need to understand if we need to really take all of these attributes into consideration. There are two things we need to consider:

First step is to test the dataset if its fits into the linearity definition, which we will perform different tests in this section. Remember, we only test for numerical columns as the categorical columns are not taken into account. As we know that the categorical values are converted into dummy variables of values 0 and 1, dummy variables meet the assumption of linearity by definition, because they creat two data points, and two points define a straight line. There is no such thing as a non-linear relationship for a single variable with only two values.

Code for Prediction: Let’s rewrite the code

import numpy as npX = data_df.iloc[:,:-1].values y = data_df.iloc[:,-1].values#handling categorical datafrom sklearn.preprocessing import LabelEncoder, OneHotEncoderle_x = LabelEncoder() X[:,3] = le_x.fit_transform(X[:,3]) from sklearn.compose import ColumnTransformertranformer = ColumnTransformer([( ‘one_hot_encoder’, OneHotEncoder(),[3])], remainder=‘passthrough’)X = np.array(tranformer.fit_transform(X), dtype=np.float) X=X[:,1:] from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=0) from sklearn.linear_model import LinearRegressionregressor = LinearRegression() regressor.fit(X_train , y_train) y_pred = regressor.predict(X_test) output_df = pd.DataFrame({ “Actual”:y_test, “Predicted”: y_pred}) |

Let’s perform the tests now:

**1. Linearity**

Linear regression needs the relationship between the independent and dependent variables to be linear. Let’s use a pair plot to check the relation of independent variables with the profit variable.

*Output:*

Python Code:

import seaborn as snsimport matplotlib.pyplot as plt# visualize the relationship between the features and the response using scatterplotsp = sns.pairplot(data_df, x_vars=[‘R&D Spend’,‘Administration’,‘Marketing Spend’], y_vars=‘Profit’, height=5, aspect=0.7)plt.show() |

By looking at the plots we can see that with the R&D Spend form an accurately linear shape and Marketing Spend is somewhat in the linear shape but Administration Spend is all over the graph but still shows increasing trend as Profit value increases on Y-Axis. Here we can use Linear Regression models.

**2. Variables follow a Normal Distribution**

The variables (X) follow a normal distribution. In order words, we want to make sure that for each x value, y is a random variable following a normal distribution and its mean lies on the regression line. One of the ways to visually test for this assumption is through the use of the Q-Q-Plot. Q-Q stands for Quantile-Quantile plot and is a technique to compare two probability distributions in a visual manner. To generate this Q-Q plot we will be using scipy’s probplot function where we compare a variable of our chosen to a normal probability.

import scipy.stats as statsstats.probplot(X[:,3], dist= “norm”, plot=plt)plt.show() |

The points must lie on this red line to conclude that it follows a normal distribution. In this case of selecting 3rd column which is R&D Spend, yes it does! A couple of points outside of the line is due to our small sample size. In practice, you decide how strict you want to be as it is a visual test.

**3. There is no or little** **multicollinearity**

Multicollinearity means that the independent variables are highly correlated with each other. X’s are called independent variables for a reason. If multicollinearity exists between them, they are no longer independent and this generates issues when modeling linear regressions.

To visually test for multicollinearity we can use the power of Pandas. We will use Pandas corr function to compute the pairwise correlation of our columns. If you find any values in which the absolute value of their correlation is >=0.8, the multicollinearity assumption is being broken.

#convert to a pandas dataframeimport pandas as pddf = pd.DataFrame(X) df.columns = [ ‘x1’,‘x2’,‘x3’,‘x4’,‘x5’]#generate correlation matrixcorr = df.corr() #Plot HeatMapp=sns.heatmap(df.corr(), annot=True,cmap=‘RdYlGn’,square=True)print( “Corelation Matrix:\n“,corr) |

**4. Check for Homoscedasticity: **The data are needs to be homoscedastic (meaning the residuals are equal across the regression line). Homoscedasticity means that the residuals have equal or almost equal variance across the regression line. By plotting the error terms with predicted terms we can check that there should not be any pattern in the error terms.

#produce regression plotsfrom statsmodels.api import OLSimport statsmodels.api as smX = sm.add_constant(X) model = OLS(y, X).fit() summ = model.summary() print( “Summary of the dataset: \n“,summ)fig = plt.figure(figsize=(12,8)) #Checking for x3 (R&D Spend)fig = sm.graphics.plot_regress_exog(model, ‘x3’, fig=fig)plt.show() |

Four plots are produced. The one in the top right corner is the residual vs. fitted plot. The x-axis on this plot shows the actual values for the predictor variable points and the y-axis shows the residual for that value. Since the residuals appear to be randomly scattered around zero, this is an indication that heteroscedasticity is not a problem with the predictor variable x3 (R&D Spend). Multiple Regression, we need to create this plot for each of the predictor variable.

**5. Mean of Residuals**

Residuals as we know are the differences between the true value and the predicted value. One of the assumptions of linear regression is that the mean of the residuals should be zero. So let’s find out.

residuals = y_test-y_pred mean_residuals = np.mean(residuals) print( “Mean of Residuals {}”.format(mean_residuals)) |

Output:

Mean of Residuals 3952.010244810798

**6. Check for Normality of error terms/residuals**

p = sns.distplot(residuals,kde=True)p = plt.title( ‘Normality of error terms/residuals’)plt.show() |

The residual terms are pretty much normally distributed for the number of test points we took.