last updated 9/22/20

Linear Regression is a modeling technique that predicts an output value, y, as a linear combination or equation of a set of independent numeric input variables, x_{i}

*y = c + a _{1}x_{1} + a_{2}x_{2} + ... + a_{n}x_{n}*

The task of this supervised technique is to first determine which set of input variables are significant and second to determine the constants **c,** and set of **a _{i}**'s

When there is a single input variable x, the equation is just a slope-intercept form of a line.

*y = b + ax*

Given a set of x's and y's, the constants b and a can be determined with the following calculations (derived in calculus):

**b = Σ(x _{i}*y_{i}) / Σ(x_{i}^{2}) **

**a = Σ(y _{i}) / Σ(n) - b * Σ(y_{i}) / n **

This is of little use being only one variable, but the process is easily extendible to multiple x's using matrix algebra methods. The equation is still a line but in n-space, where n is the (number of independent variables)+1

Excel provides a function LINEST to apply this linear regression (least squares) method.

Read the coefficients in reverse so in this example:

*temperature = 98.645 - 2.16*latitude + 0.114*longitude*

The higher the latitude the colder the temperature! The longitude doesn't matter as much although western cities seem to be warmer in January.

The R-squared (0.741) is what you use to compare the model "closeness" to the data set. Models that have higher r-squared values are better. In this case the model produced for just Latitude to predict temperature had an r-square of 0.711 which isn't as good, so including longitude is better.

In Python/Jupyter/sklearn

from sklearn import linear_model from sklearn.metrics import mean_squared_error, r2_score # Create linear regression object regr = linear_model.LinearRegression() # Fit regression model to the training set regr.fit(X_train, y_train) # Apply model to the test set y_pred_test = regr.predict(X_test)

See tutorial5 from the text materials

Combining the use of decision tree and linear regression allows you to generate and use different or more appropriate linear regression models based on certain criteria. See the general tree below. The leaves would each be different linear regression models.

Converting categorical attributes to 0 and 1 (weka does this) allows you then to use linear regression on the resulting numeric attributes. We saw this in an early example

*lifeInsurancePromo = 0.591(CreditCardIns) - 0.546 (sex) +0.773 *

These coefficients are hard to interpret, however.