Linear Regression

last updated 9/22/20

Introduction

Linear Regression is a modeling technique that predicts an output value, y, as a linear combination or equation of a set of independent numeric input variables, xi

y = c + a1x1 + a2x2 + ... + anxn

The task of this supervised technique is to first determine which set of input variables are significant and second to determine the constants c, and set of ai's

When there is a single input variable x, the equation is just a slope-intercept form of a line.

y = b + ax

Given a set of x's and y's, the constants b and a can be determined with the following calculations (derived in calculus):

b = Σ(xi*yi) / Σ(xi2)

a = Σ(yi) / Σ(n) - b * Σ(yi) / n

This is of little use being only one variable, but the process is easily extendible to multiple x's using matrix algebra methods. The equation is still a line but in n-space, where n is the (number of independent variables)+1

Excel provides a function LINEST to apply this linear regression (least squares) method.

linest function in Excel

Read the coefficients in reverse so in this example:

temperature = 98.645 - 2.16*latitude + 0.114*longitude

The higher the latitude the colder the temperature! The longitude doesn't matter as much although western cities seem to be warmer in January.

The R-squared (0.741) is what you use to compare the model "closeness" to the data set. Models that have higher r-squared values are better. In this case the model produced for just Latitude to predict temperature had an r-square of 0.711 which isn't as good, so including longitude is better.

In Python/Jupyter/sklearn

from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score

# Create linear regression object
regr = linear_model.LinearRegression()

# Fit regression model to the training set
regr.fit(X_train, y_train)

# Apply model to the test set 
y_pred_test = regr.predict(X_test)

See tutorial5 from the text materials

 


Regression Trees

Combining the use of decision tree and linear regression allows you to generate and use different or more appropriate linear regression models based on certain criteria. See the general tree below. The leaves would each be different linear regression models.

Regression Tree example

 


Logistic Regression

Converting categorical attributes to 0 and 1 (weka does this) allows you then to use linear regression on the resulting numeric attributes. We saw this in an early example

lifeInsurancePromo = 0.591(CreditCardIns) - 0.546 (sex) +0.773

These coefficients are hard to interpret, however.