Supervised Machine Learning: Linear Regression#

Introduction #

Main Idea:

  • the output (dependent) variable is continuous and we want to “predict” its value within the range of the input features. (WARNING: doing otherwise could lead to flawed inferences).

  • there is “noise” which means that for essentially the same input values there may be different slightly different values of the output variable or there is “noise” in the measurement of all the variables.

  • we assume that the noise (i.e. the errors in measurement) are following a normal distribution with mean 0 and some unknown standard deviation.

Seminal Work: The linear correlation coefficient between two variables was introduced by Pearson, Karl (20 June 1895). “Notes on regression and inheritance in the case of two parents”. Proceedings of the Royal Society of London. 58: 240–242.)

\[\large r: = \frac{1}{n-1}\sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\]

Here \(\bar{x}\) is the mean of \(x\), \(\bar{y}\) is the mean of \(y\) and, \(s_x\) is the standard deviation of \(x\) and \(s_y\) is the standard deviation of \(y.\)

Visual Intuition:

The Linear Correlation Concept

Test statistic for the corelation coefficient:

\[\large t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\]

where \(n\) is the number of observations and \(r\) represents the correlation coefficient computed from the data.

The slope of the regression line is

\[\large m = r\cdot\frac{s_y}{s_x}\]

Theoretical Perspective:

  • we want to estimate the expected value of the dependent variable as a function of the input features. Thus we want to approximate a conditional expectation \(\mathbb{E}(Y|\text{input features})\) as a function of the input features such as

\[\large\mathbb{E}(Y|X=x) = f(x)\]
  • we want to determine the simplest form of the function \(f\) (principle of parsimony) and we assume that

\[\large Y = f(X) + \sigma \epsilon\]

where \(\epsilon\) is the “noise”, in statistical terms, \(\epsilon\) is independent and identically distributed, it follows a standard normal distribution and, \(\sigma>0\) is generally unknown.

The Coefficient of Determination

\[\large R^2:=1-\frac{\sum (residual_i)^2}{\sum(y_i-\bar{y})^2}\]

Multiple Linear Regression (Linear models with more features)#

Important The matrix vector product is a linear combination of the columns of the matrix:

\[\begin{split}\large X\beta =\beta_1\begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{bmatrix} + \beta_2\begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{bmatrix} + ... \beta_p\begin{bmatrix} x_{1p} \\ x_{2p} \\ \vdots \\ x_{np} \end{bmatrix} \end{split}\]

where

\[\begin{split}\large X = \begin{bmatrix} x_{11}, x_{12}, ... x_{1p} \\ x_{21},x_{22}, ...x_{2p} \\ \vdots \\ x_{n1}, x_{n2}, ... x_{np} \end{bmatrix} \end{split}\]

Vector Spaces #

Axiom

Meaning

Associativity of vector addition

u + (v + w) = (u + v) + w

Commutativity of vector addition

u + v = v + u

Identity element of vector addition

There exists an element 0∈ V, called the zero vector, such that v + 0 = v for all v∈ V.

Inverse elements of vector addition

For every v∈ V, there exists an element −v ∈ V, called the additive inverse of v, such that v + (−v) = 0.

Compatibility of scalar multiplication with field multiplication

a(bv) = (ab)v[nb 3]

Identity element of scalar multiplication

1v = v, where 1 denotes the multiplicative identity in F.

Distributivity of scalar multiplication with respect to vector addition

a(u + v) = au + av

Distributivity of scalar multiplication with respect to field addition

(a + b)v = av + bv

An example of a linear model with two features is \(\hat{y}_i = 1+3x_{i1}+5x_{i2}.\)

In this example the value \(1\) is referred to as the intercept.

If \(p\) features in the data and we want to create a linear model, the input-output mechanism is

\[ \underbrace{Y}_\text{Output} = \underbrace{\beta_1 X_1+\beta_2 X_2+...+\beta_p X_p}_\text{Linear combination of features} \]

This could represented as a matrix-vector product:

\[ \underbrace{Y}_\text{Output} = \underbrace{X\cdot \beta}_\text{Linear combination of the columns of matrix X} \]

In this model the features are \(X_1, X_2, ...X_p\) and \(\beta_1, \beta_2,...\beta_p\) are a set of weights (real numbers).

The assumption for multiple linear regression is

\[\large Y = X\beta + \sigma \epsilon \]

where \(\sigma\) is the standard deviatin of the noise. Further, we assume that the “noise” \(\epsilon\) is independent and identically distributed with a zero mean.

We believe that the output is a linear combination of the input features.

Thus, if we would like to solve for the “weights” \(\beta\) we may consider

\[\large X^tY = X^tX\beta+\sigma X^t\epsilon \]

And if the matrix \(X^tX\) is invertible then we can solve for expected value of \(\beta\):

\[\large \mathbb{E}(\beta) = (X^tX)^{-1}X^t Y \]

We can show by using Linear Algebra that the OLS solution obtained form minimizing the sum of the square residuals is equivalent.

Linear vs Non-linear models#

This is a linear model in terms of the weights \(\beta\):

\[\large \hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 - \beta_3x_3 \]

An example for what linear in weights means $\( \hat{y}(2\beta+3\alpha) = 2\hat{y}(\beta)+3\hat{y}(\alpha) \)$

The following is a non-linear model in terms of the coefficients (weights):

\[\large \hat{y} = \beta_0 + \beta_1^3x_1 + \frac{1}{\beta_2+\beta_3}x_2 - e^{\beta_3}x_3 \]
$$ \hat{y}(2\beta+3\alpha) \neq 2\hat{y}(\beta)+3\hat{y}(\alpha) $$

The main point of linear regression is to assume that predictions can ben made by using a linear combination of the features.

For example, if the data from feature \(j\) is a column vector, e.g. \begin{bmatrix} x_{1j} \ x_{2j} \ \vdots \ x_{nj} \end{bmatrix}
then we assume that the depdendent variable is predicted by a linear combination of these columns populated with features’ data. Each column represents a feature and each row an independent observation.

The predicted value is denoted by \(\hat{y}\) and

\[\begin{split} \hat{y} = \beta_1\begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{bmatrix} + \beta_2\begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{bmatrix} + ... \beta_p\begin{bmatrix} x_{1p} \\ x_{2p} \\ \vdots \\ x_{np} \end{bmatrix} \end{split}\]

We have a vector of weights:#

\[\begin{split} \beta = \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{p} \end{bmatrix} \end{split}\]

#

Critical thinking: what exactly is \(\hat{y}\)?

The matrix-vector product between the feaures and the weights $\( \hat{y} = X\cdot\beta \)$

The main idea is that

\[\begin{split} \hat{y}= \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \vdots \\ \hat{y}_{n} \end{bmatrix} \end{split}\]

represents the predictions we make by training (or as we say in ML learning) the weights \(\beta.\)

Training means running an optimization algorithm and determining the values of the weights that minimize an objective function.

We want to learn the weights \(\beta_1,\beta_2,...\beta_p\) that minimize the sum of the squared residuals:#

\[\large \sum\limits_{i=1}^{n}\left(y_i-\sum\limits_{j=1}^{p}X_{i,j}\cdot\beta_j\right)^2 = \sum\limits_{i=1}^{n}\left(y_i-X_{i,1}\beta_1-X_{i,2}\beta_2 - ...X_{i,p}\beta_p\right)^2 \]

How do we know we are on the right track after we perform the minimization of the square residuals? #

## Ordinary Least Squares Regression (OLS)

First, we assume the simplest case: data has only one input feature that is continuous.

The main idea of linear regression is the expected value of the output is a linear function of the input variable(s).

\[\mathbb{E}(Y|X=x)\approx m\cdot x + b\]

To determine the line of best fit the goal is to minimize the sum of squared residuals:

$$ \min_{m,b} \sum\limits_{i=1}^{n}(y_i-mx_i-b)^2 $$

So the sum of the squared residuals is

\[ \sum\limits_{i=1}^{n}(y_i-mx_i-b)^2 \]

If \(N\) represents the number of observations (the number of rows in the data) then the cost function may be defined $\( L(m,b) = \frac{1}{N} \sum\limits_{i=1}^{N}(y_i-mx_i-b)^2 \)$

where

\[\hat{y_i} = m\cdot x_i +b.\]

If we get our predictions \(\hat{y}_i\) then we have that the Mean Squared Error is

\[ \frac{1}{N} \sum\limits_{i=1}^{N}(y_i-\hat{y}_i)^2 \]

Critical Thinking: at the optimal values \((m,b)\) the partial derivatives of the cost function \(L\) are equal to 0.

The gradient descent algorithm is based on this idea.

Thus, the equation of the best fit line is $\(y = mx + b.\)$

CRITICAL THINKING: How exactly are we obtaining the slope and the intercept?

ANSWER: One way to obtain the slope and the intercept is by applying the Ordinary Least Squares method.

We determine the values of \(m\) and \(b\) such that the sum of the square distances between the points and the line is minimal.

Source: Simple Linear Regression (Tobias Roeschl)

### Diagnostics for Regression #### The Coefficient of Determination

\[\large R^2:=1-\frac{\sum (residual_i)^2}{\sum(y_i-\bar{y})^2}\]

We know we make a good job when R2 is very close to 1. We make a very poor job if R2 is close to 0 or even negative.

### Tests for Normality

We believe that, if the residuals are normally distributed, then the average of the errors is a meaningful estimator for the model’s performance.

#### What is a Normality Test?

Assume we have a univariate set of values ( we have a sample from a univariate distribution.) We are checking, based on some calculations with the sample data, if the univariate distribution is a normal one.

Main Idea: We are measuring the nonlinear correlation between the empirical density function of the data vs the theoretical density of a standard normal distribution.

We want to recast this matching procedure onto the backdrop of a linear correlation situation; this means we want to compare the two cumulative distribution functions. To explain, we want the empirical percentiles to correlate linearly with the theoretical percentiles of a standard normal distribution.

### The Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test uses the concept of cummulative distribution functions:

\[CDF(x):= P(X<x) = \int_{-∞}^{x}f(t)dt\]

The concept is useful in applications where we want to check if a random variable follows a certain distribution.

IMPORTANT: In most cases we standardize the values of the random variable, e.g we compute z-scores.

The test is defined as:

H0 (the null hypothesis): The data follow a specified distribution.

H1 (the alternative hypothesis): The data do not follow the specified distribution

The main idea is that we focus on how much the empirical cummulative distribution function is different from the theoretical cummulative distribution function, and we may consider:

\[\sup_{x} |ECDF(x) - CDF(x)|\]

where \(ECDF(x)\) means the emprirical cummulative distribution function:

\[ECDF(x):= \frac{1}{n}\sum \mathbb{1}(t)_{t<x}\]

and, \(CDF\) stands for the cummulative distribution function:

\[ \begin{align}\begin{aligned}CDF(x):= \int_{-\infty}^{x}f(t)dt.$$ Here $f$ is the probability density function.\\If we order the observations, such as $x_i\leq x_j$ when $i\leq j$, then the test statistic is formally defined by:\\$$D:=\max_{1\leq i\leq n}\left\{CDF(x_i)-\frac{i-1}{n},\frac{i}{n}-CDF(x_i)\right\}\end{aligned}\end{align} \]

The mathematical notation means that we add \(1\) for each \(t\) less than \(x\) and \(n\) represents the sample size.

If the p-value is high (much greater then 5%) we do not reject the null hypethesis which means that the normality assumption is not violated.

The Anderson-Darling Test#

The test is defined as:

H0 (the null hypothesis): The data follow a specified distribution.

H1 (the alternative hypothesis): The data do not follow the specified distribution

The test statistic is defined as:

\[\large AD := -n - \sum_{i=1}^{n} \frac{2i-1}{n}\left[\ln(CDF(x_i))+\ln(1-CDF(x_{n+1-i})\right] \]

The critical values for the Anderson-Darling test are dependent on the specific distribution that is being tested.

Error Metrics#

MSE#

\[\text{MSE}:=\frac{1}{n}\sum_{i=1}^{n}(y_i-x_i\cdot\vec{\beta})^2\]

here the i-th observation has multiple features:

\[x_i = \left(x_{i1},x_{i2},...x_{ip}\right)\]

where the “dot” product is defined as

\[x_i\cdot\vec{\beta} = \sum_{j=1}^{p} x_{ij}\beta_j\]

RMSE#

Root mean squared error:

\[\text{RMSE}:=\left(\frac{1}{n}\sum_{i=1}^{n}(y_i-x_i\cdot\vec{\beta})^2\right)^{1/2}\]

MAE#

Mean absolute error:

\[\text{MAE}:=\frac{1}{n}\sum_{i=1}^{n}\left|y_i-x_i\cdot\vec{\beta}\right|\]

### Code Applications


In the following example we learn how to write a code in Python for determining the line of best fit given one dependent variable and one input feature. That is to say we are going to determine a slope 𝑚 and an intercept 𝑛 , the equation of the best fit line being 𝑦=𝑚𝑥+𝑏.

We are going to analyze a real data set that was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-1974 models).

Setup#

import os
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
  from google.colab import drive
  drive.mount('/content/drive')
  os.chdir('/content/drive/My Drive/Data Sets')
  !pip install -q pygam
else:
  print('Running locally')
  os.chdir('../Data')
# Library Setup
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
from __future__ import print_function
from scipy.stats import ksone

Data Import#

# data
cars = pd.read_csv('mtcars.csv')
cars
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
5 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
6 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
7 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
9 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
11 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
17 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
21 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
25 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
28 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
29 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
30 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
31 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Objective 1: We want to to know if there is any association between the weight of the car and the mileage.

cars['mpg'].values
array([21. , 21. , 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,
       16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5,
       15.2, 13.3, 19.2, 27.3, 26. , 30.4, 15.8, 19.7, 15. , 21.4])
y = cars.mpg.values
x = cars.wt.values
model = LinearRegression()
model.fit(x.reshape(-1,1),y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
model.coef_
array([-5.34447157])

Relationship between x and y#

y = cars['mpg'].values
x = cars['wt'].values
xb, yb = np.mean(x) , np.mean(y) # here we compute the center of mass
xb
3.21725
fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(x,y,color='cyan',s=50,ec='black',alpha=0.7)
ax.set_xlim(0, 6)
ax.set_ylim(0, 50)
ax.set_xlabel('Weight of the car in 1000lbs',fontsize=14,color='navy')
ax.set_ylabel('Miles Per Gallon',fontsize=14,color='navy')
ax.grid(which='major', color ='grey', linestyle='-', alpha=0.8)
ax.grid(which='minor', color ='grey', linestyle='--', alpha=0.2)
plt.axvline(x=xb, color='red',linestyle='dashed')
plt.axhline(y=yb, color='red',linestyle='dashed')
plt.tick_params(axis='x', color='navy')
plt.tick_params(axis='y', color='navy')
ax.minorticks_on()
plt.show()
# here we make x a column vector or a matrix
x = x.reshape(-1,1) #technical
model = LinearRegression()
model.fit(x,y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
m = model.coef_
b = model.intercept_
print('The Slope is :'+str(m),' and the intercept is :' +str(b))
The Slope is :[-5.34447157]  and the intercept is :37.28512616734204
from scipy.stats import pearsonr
# we have to "ravel" back the x
x = x.ravel()
r, pval = pearsonr(x,y)
print('The correlation coefficient is :'+str(r)+'  and the pvalue for significance is :'+str(pval))
The correlation coefficient is :-0.8676593765172278  and the pvalue for significance is :1.293958701350513e-10
m = r*np.std(y)/np.std(x)
print('The slope of the trend is :' +str(m))
The slope of the trend is :-5.344471572722676

The equation of the line passing through the center of mass and that captires the trend is: $\(y-\bar{y} = m\cdot(x-\bar{x})\)$

In short, $\(y=m\cdot x + b\)\( where \)b = \bar{y} -m\cdot\bar{x}$

b = yb-m*xb
print('In our case the intercept is: ' +str(b)) # what is the meaning of the intercept?
In our case the intercept is: 37.285126167342035

### OLS Equivalence

The main idea is that in 1-D the OLS estimation is equivalence with the calculation of the slope via the Pearson correlation coefficient.

def loss(b, m, data): # this is also called "loss" function
    total_cost = 0

    # number of datapoints in training data
    N = float(len(data))

    # Compute sum of squared errors
    for i in range(0, len(data)):
        x = data[i, 0]
        y = data[i, 1]
        total_cost += (y - (m * x + b)) ** 2

    # Return average of squared error
    return total_cost/N
# this is the actual gradient descent
def step_gradient(b_current, m_current, data, alpha):
    """takes one step down towards the minima

    Args:
        b_current (float): current value of b
        m_current (float): current value of m
        data (np.array): array containing the training data (x,y)
        alpha (float): learning rate / step size

    Returns:
        tuple: (b,m) new values of b,m
    """

    m_gradient = 0
    b_gradient = 0
    N = float(len(data))

    # Calculate Gradient - assuming you know partial derivatives
    for i in range(0, len(data)):
        x = data[i, 0]
        y = data[i, 1]
        m_gradient += - (2/N) * x * (y - (m_current * x + b_current)) # is the partial derivative with respect to m
        b_gradient += - (2/N) * (y - (m_current * x + b_current)) # is the partial derivative with respect to b

    # Update current m and b, alpha stands for learning rate
    # we proceed in the direction of the negative gradient
    m_updated = m_current - alpha * m_gradient
    b_updated = b_current - alpha * b_gradient

    #Return updated parameters
    return b_updated, m_updated

def gradient_descent(data, starting_b, starting_m, learning_rate, num_iterations):
    """runs gradient descent

    Args:
        data (np.array): training data, containing x,y
        starting_b (float): initial value of b (random)
        starting_m (float): initial value of m (random)
        learning_rate (float): hyperparameter to adjust the step size during descent
        num_iterations (int): hyperparameter, decides the number of iterations for which gradient descent would run

    Returns:
        list : the first and second item are b, m respectively at which the best fit curve is obtained, the third and fourth items are two lists, which store the value of b,m as gradient descent proceeded.
    """

    # initial values
    b = starting_b
    m = starting_m

    # to store the cost after each iteration
    cost_graph = []

    # to store the value of b -> bias unit, m-> slope of line after each iteration (pred = m*x + b)
    b_progress = []
    m_progress = []

    # For every iteration, optimize b, m and compute its cost
    for i in range(num_iterations):
        cost_graph.append(loss(b, m, data))
        b, m = step_gradient(b, m, data, learning_rate)
        b_progress.append(b)
        m_progress.append(m)

    return [b, m, cost_graph,b_progress,m_progress]
#hyperparamters
learning_rate = 0.001
initial_b = 12
initial_m = 3
num_iterations = 50000
data = np.column_stack([x,y])
b, m, cost_graph,b_progress,m_progress = gradient_descent(data, initial_b, initial_m, learning_rate, num_iterations)

#Print optimized parameters
print ('Optimized b:', b)
print ('Optimized m:', m)

#Print error with optimized parameters
print ('Minimized cost:', loss(b, m, data))
Optimized b: 37.27234604666433
Optimized m: -5.340801130745784
Minimized cost: 8.697573986717137
fig, ax = plt.subplots(figsize=(10,6))
plt.plot(cost_graph)
ax.set_xlabel('No. of iterations',fontsize=14,color='blue')
ax.set_ylabel('Cost',fontsize=14,color='blue')
plt.title('Cost per iteration',color='red',fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
ax.tick_params(axis='x', colors='forestgreen')
ax.tick_params(axis='y', colors='forestgreen')
plt.show()
_images/Supervised_Learning_Linear_Regression_48_0.png
from sklearn.linear_model import LinearRegression
# what is the shape of x?
x.shape
(32,)
model = LinearRegression()
model.fit(x.reshape(-1,1),y)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
model.coef_ , model.intercept_ # this is what the has learned
(array([-5.34447157]), 37.28512616734204)
x_range = np.linspace(np.min(x)-1,np.max(x)+1,2)
yhat = model.predict(x_range.reshape(-1,1))
#Plot dataset
plt.scatter(x, y,ec='k',color='cyan',alpha=0.6)
#Predict y values
pred = model.coef_ * x_range + model.intercept_
#Plot predictions as line of best fit
plt.plot(x_range, pred, c='g')
plt.xlim(0, 6)
plt.ylim(0, 50)
plt.xlabel('Independent Variable $X$',fontsize=14,color='navy')
plt.ylabel('Dependent Variable $Y$',fontsize=14,color='navy')
plt.grid(which='major', color ='grey', linestyle='-', alpha=0.8)
plt.grid(which='minor', color ='grey', linestyle='--', alpha=0.2)
plt.axvline(x=xb, color='red',linestyle='dashed')
plt.axhline(y=yb, color='purple',linestyle='dashed')
plt.tick_params(axis='x', color='navy')
plt.tick_params(axis='y', color='navy')
plt.minorticks_on()
plt.title('Line of best fit',fontsize=14)
plt.legend(['Data','Trend','Mean of $X$','Mean of $Y$'])
plt.savefig('lineacorrelation.png',dpi=300)
plt.show()
model.score(x.reshape(-1,1),y)
0.7528327936582646
r**2
0.7528327936582647

###Test the Residuals for Goodness of fit

We investigate the distribution of the residuals, plot a histogram and apply a normality test

residuals = y - model.predict(x.reshape(-1,1))
import seaborn as sns

# import uniform distribution
from scipy import stats
from scipy.stats import norm
ax1 = sns.distplot(residuals,

                  bins=8,
                  kde=False,
                  color='deepskyblue',
                  hist_kws={"color":'green','ec':'black'},
                  fit=stats.norm,
                  fit_kws={"color":'red'})
ax1.set(xlabel='Residuals', ylabel='Frequency')
plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
_images/Supervised_Learning_Linear_Regression_60_1.png

Message: The histogram does not quite look like a normal distribution. We can also consider a Q-Q Plot:#

import statsmodels.api as sm
sm.qqplot(residuals/np.std(residuals), loc = 0, scale = 1, line='s',alpha=0.5)
plt.xlim([-2.5,2.5])
plt.ylim([-2.5,2.5])
plt.axes().set_aspect('equal')
plt.grid(b=True,which='major', color ='grey', linestyle='-', alpha=0.5)
plt.grid(b=True,which='minor', color ='grey', linestyle='--', alpha=0.15)
plt.minorticks_on()
plt.show()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  """
_images/Supervised_Learning_Linear_Regression_62_1.png
dist = getattr(stats, 'norm')
params = dist.fit(residuals)
params
(-8.215650382226158e-15, 2.949162685955028)
# here we apply the test
stats.kstest(residuals, "norm", params)
KstestResult(statistic=0.08217402470387336, pvalue=0.9821261392158506)

The conclusion, in our case, is that the normality assumption is not violated.#

# for your convenience, this is the table with the critical values for the
# Kolmogorov-Smirnov test

def ks_critical_value(n_trials, alpha):
    return ksone.ppf(1-alpha/2, n_trials)

trials = range(1, 40)
alphas = [0.1, 0.05, 0.02, 0.01]

# Print table headers
print('{:<6}|{:<6} Level of significance, alpha'.format(' ', ' '))
print('{:<6}|{:>8} {:>8} {:>8} {:>8}'.format(*['Trials'] + alphas))
print('-' * 42)
# Print critical values for each n_trials x alpha combination
for t in trials:
    print('{:6d}|{:>8.5f} {:>8.5f} {:>8.5f} {:>8.5f}'
          .format(*[t] + [ks_critical_value(t, a) for a in alphas]))
    if t % 10 == 0:
        print()
ks_critical_value(31,0.4)
0.15594527177147208

A different normality test indicates the same thing#

stats.anderson(residuals,dist='norm')
AndersonResult(statistic=0.46842144463906266, critical_values=array([0.523, 0.596, 0.715, 0.834, 0.992]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))
stats.shapiro(residuals)
(0.9450768828392029, 0.10438867658376694)
#@title
model = sm.OLS(y, x)
results = model.fit()
print(results.summary())
x = cars.iloc[:,2:8]
x
cyl disp hp drat wt qsec
0 6 160.0 110 3.90 2.620 16.46
1 6 160.0 110 3.90 2.875 17.02
2 4 108.0 93 3.85 2.320 18.61
3 6 258.0 110 3.08 3.215 19.44
4 8 360.0 175 3.15 3.440 17.02
5 6 225.0 105 2.76 3.460 20.22
6 8 360.0 245 3.21 3.570 15.84
7 4 146.7 62 3.69 3.190 20.00
8 4 140.8 95 3.92 3.150 22.90
9 6 167.6 123 3.92 3.440 18.30
10 6 167.6 123 3.92 3.440 18.90
11 8 275.8 180 3.07 4.070 17.40
12 8 275.8 180 3.07 3.730 17.60
13 8 275.8 180 3.07 3.780 18.00
14 8 472.0 205 2.93 5.250 17.98
15 8 460.0 215 3.00 5.424 17.82
16 8 440.0 230 3.23 5.345 17.42
17 4 78.7 66 4.08 2.200 19.47
18 4 75.7 52 4.93 1.615 18.52
19 4 71.1 65 4.22 1.835 19.90
20 4 120.1 97 3.70 2.465 20.01
21 8 318.0 150 2.76 3.520 16.87
22 8 304.0 150 3.15 3.435 17.30
23 8 350.0 245 3.73 3.840 15.41
24 8 400.0 175 3.08 3.845 17.05
25 4 79.0 66 4.08 1.935 18.90
26 4 120.3 91 4.43 2.140 16.70
27 4 95.1 113 3.77 1.513 16.90
28 8 351.0 264 4.22 3.170 14.50
29 6 145.0 175 3.62 2.770 15.50
30 8 301.0 335 3.54 3.570 14.60
31 4 121.0 109 4.11 2.780 18.60
model.fit(x,y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
model.score(x,y)
0.8548224115848234
residuals = y - model.predict(x)

sns.distplot(residuals,
                  bins=8,
                  kde=False,
                  color='deepskyblue',
                  hist_kws={"color":'green','ec':'black'},
                  fit=stats.norm,
                  fit_kws={"color":'red'})
plt.show()
/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
_images/Supervised_Learning_Linear_Regression_78_1.png
dist = getattr(stats, 'norm')
params = dist.fit(residuals)
# here we apply the Kolmogorov-Smirnov test
stats.kstest(residuals, "norm", params) # the confidence has increased compared to the case of using only one input feature.
KstestResult(statistic=0.15722261669146254, pvalue=0.37091468826166857)