Supervised Machine Learning: Linear Regression

Supervised Machine Learning: Linear Regression#

Introduction #

Main Idea:

the output (dependent) variable is continuous and we want to “predict” its value within the range of the input features. (WARNING: doing otherwise could lead to flawed inferences).
there is “noise” which means that for essentially the same input values there may be different slightly different values of the output variable or there is “noise” in the measurement of all the variables.
we assume that the noise (i.e. the errors in measurement) are following a normal distribution with mean 0 and some unknown standard deviation.

Seminal Work: The linear correlation coefficient between two variables was introduced by Pearson, Karl (20 June 1895). “Notes on regression and inheritance in the case of two parents”. Proceedings of the Royal Society of London. 58: 240–242.)

\[\large r: = \frac{1}{n-1}\sum_{i=1}^{n} \left(\frac{x_i - \bar{x}}{s_x}\right)\left(\frac{y_i - \bar{y}}{s_y}\right)\]

Here $\bar{x}$ is the mean of $x$, $\bar{y}$ is the mean of $y$ and, $s_x$ is the standard deviation of $x$ and $s_y$ is the standard deviation of $y.$

Visual Intuition:

Test statistic for the corelation coefficient:

\[\large t=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}\]

where $n$ is the number of observations and $r$ represents the correlation coefficient computed from the data.

The slope of the regression line is

\[\large m = r\cdot\frac{s_y}{s_x}\]

Theoretical Perspective:

we want to estimate the expected value of the dependent variable as a function of the input features. Thus we want to approximate a conditional expectation $\mathbb{E}(Y|\text{input features})$ as a function of the input features such as

\[\large\mathbb{E}(Y|X=x) = f(x)\]

we want to determine the simplest form of the function $f$ (principle of parsimony) and we assume that

\[\large Y = f(X) + \sigma \epsilon\]

where $\epsilon$ is the “noise”, in statistical terms, $\epsilon$ is independent and identically distributed, it follows a standard normal distribution and, $\sigma>0$ is generally unknown.

The Coefficient of Determination

\[\large R^2:=1-\frac{\sum (residual_i)^2}{\sum(y_i-\bar{y})^2}\]

Multiple Linear Regression (Linear models with more features)#

Important The matrix vector product is a linear combination of the columns of the matrix:

\[\begin{split}\large X\beta =\beta_1\begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{bmatrix} + \beta_2\begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{bmatrix} + ... \beta_p\begin{bmatrix} x_{1p} \\ x_{2p} \\ \vdots \\ x_{np} \end{bmatrix} \end{split}\]

where

\[\begin{split}\large X = \begin{bmatrix} x_{11}, x_{12}, ... x_{1p} \\ x_{21},x_{22}, ...x_{2p} \\ \vdots \\ x_{n1}, x_{n2}, ... x_{np} \end{bmatrix} \end{split}\]

Vector Spaces #

Axiom	Meaning
Associativity of vector addition	u + (v + w) = (u + v) + w
Commutativity of vector addition	u + v = v + u
Identity element of vector addition	There exists an element 0∈ V, called the zero vector, such that v + 0 = v for all v∈ V.
Inverse elements of vector addition	For every v∈ V, there exists an element −v ∈ V, called the additive inverse of v, such that v + (−v) = 0.
Compatibility of scalar multiplication with field multiplication	a(bv) = (ab)v^{[nb 3]}
Identity element of scalar multiplication	1v = v, where 1 denotes the multiplicative identity in F.
Distributivity of scalar multiplication with respect to vector addition	a(u + v) = au + av
Distributivity of scalar multiplication with respect to field addition	(a + b)v = av + bv

An example of a linear model with two features is $\hat{y}_i = 1+3x_{i1}+5x_{i2}.$

In this example the value $1$ is referred to as the intercept.

If $p$ features in the data and we want to create a linear model, the input-output mechanism is

\[ \underbrace{Y}_\text{Output} = \underbrace{\beta_1 X_1+\beta_2 X_2+...+\beta_p X_p}_\text{Linear combination of features} \]

This could represented as a matrix-vector product:

\[ \underbrace{Y}_\text{Output} = \underbrace{X\cdot \beta}_\text{Linear combination of the columns of matrix X} \]

In this model the features are $X_1, X_2, ...X_p$ and $\beta_1, \beta_2,...\beta_p$ are a set of weights (real numbers).

The assumption for multiple linear regression is

\[\large Y = X\beta + \sigma \epsilon \]

where $\sigma$ is the standard deviatin of the noise. Further, we assume that the “noise” $\epsilon$ is independent and identically distributed with a zero mean.

We believe that the output is a linear combination of the input features.

Thus, if we would like to solve for the “weights” $\beta$ we may consider

\[\large X^tY = X^tX\beta+\sigma X^t\epsilon \]

And if the matrix $X^tX$ is invertible then we can solve for expected value of $\beta$:

\[\large \mathbb{E}(\beta) = (X^tX)^{-1}X^t Y \]

We can show by using Linear Algebra that the OLS solution obtained form minimizing the sum of the square residuals is equivalent.

Linear vs Non-linear models#

This is a linear model in terms of the weights $\beta$:

\[\large \hat{y} = \beta_0 + \beta_1x_1 + \beta_2x_2 - \beta_3x_3 \]

An example for what linear in weights means $$ \hat{y}(2\beta+3\alpha) = 2\hat{y}(\beta)+3\hat{y}(\alpha) $$

The following is a non-linear model in terms of the coefficients (weights):

\[\large \hat{y} = \beta_0 + \beta_1^3x_1 + \frac{1}{\beta_2+\beta_3}x_2 - e^{\beta_3}x_3 \]

$$ \hat{y}(2\beta+3\alpha) \neq 2\hat{y}(\beta)+3\hat{y}(\alpha) $$

The main point of linear regression is to assume that predictions can ben made by using a linear combination of the features.

For example, if the data from feature $j$ is a column vector, e.g. \begin{bmatrix} x_{1j} \ x_{2j} \ \vdots \ x_{nj} \end{bmatrix}
then we assume that the depdendent variable is predicted by a linear combination of these columns populated with features’ data. Each column represents a feature and each row an independent observation.

The predicted value is denoted by $\hat{y}$ and

\[\begin{split} \hat{y} = \beta_1\begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{bmatrix} + \beta_2\begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \end{bmatrix} + ... \beta_p\begin{bmatrix} x_{1p} \\ x_{2p} \\ \vdots \\ x_{np} \end{bmatrix} \end{split}\]

We have a vector of weights:#

\[\begin{split} \beta = \begin{bmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{p} \end{bmatrix} \end{split}\]

#

Critical thinking: what exactly is $\hat{y}$?

The matrix-vector product between the feaures and the weights $$ \hat{y} = X\cdot\beta $$

The main idea is that

\[\begin{split} \hat{y}= \begin{bmatrix} \hat{y}_{1} \\ \hat{y}_{2} \\ \vdots \\ \hat{y}_{n} \end{bmatrix} \end{split}\]

represents the predictions we make by training (or as we say in ML learning) the weights $\beta.$

Training means running an optimization algorithm and determining the values of the weights that minimize an objective function.

We want to learn the weights $\beta_1,\beta_2,...\beta_p$ that minimize the sum of the squared residuals:#

\[\large \sum\limits_{i=1}^{n}\left(y_i-\sum\limits_{j=1}^{p}X_{i,j}\cdot\beta_j\right)^2 = \sum\limits_{i=1}^{n}\left(y_i-X_{i,1}\beta_1-X_{i,2}\beta_2 - ...X_{i,p}\beta_p\right)^2 \]

How do we know we are on the right track after we perform the minimization of the square residuals? #

## Ordinary Least Squares Regression (OLS)

First, we assume the simplest case: data has only one input feature that is continuous.

The main idea of linear regression is the expected value of the output is a linear function of the input variable(s).

\[\mathbb{E}(Y|X=x)\approx m\cdot x + b\]

To determine the line of best fit the goal is to minimize the sum of squared residuals:

$$ \min_{m,b} \sum\limits_{i=1}^{n}(y_i-mx_i-b)^2 $$

So the sum of the squared residuals is

\[ \sum\limits_{i=1}^{n}(y_i-mx_i-b)^2 \]

If $N$ represents the number of observations (the number of rows in the data) then the cost function may be defined $$ L(m,b) = \frac{1}{N} \sum\limits_{i=1}^{N}(y_i-mx_i-b)^2 $$

where

\[\hat{y_i} = m\cdot x_i +b.\]

If we get our predictions $\hat{y}_i$ then we have that the Mean Squared Error is

\[ \frac{1}{N} \sum\limits_{i=1}^{N}(y_i-\hat{y}_i)^2 \]

Critical Thinking: at the optimal values $(m,b)$ the partial derivatives of the cost function $L$ are equal to 0.

The gradient descent algorithm is based on this idea.

Thus, the equation of the best fit line is $$y = mx + b.$$

CRITICAL THINKING: How exactly are we obtaining the slope and the intercept?

ANSWER: One way to obtain the slope and the intercept is by applying the Ordinary Least Squares method.

We determine the values of $m$ and $b$ such that the sum of the square distances between the points and the line is minimal.

Source: Simple Linear Regression (Tobias Roeschl)

### Diagnostics for Regression #### The Coefficient of Determination

\[\large R^2:=1-\frac{\sum (residual_i)^2}{\sum(y_i-\bar{y})^2}\]

We know we make a good job when R2 is very close to 1. We make a very poor job if R2 is close to 0 or even negative.

### Tests for Normality

We believe that, if the residuals are normally distributed, then the average of the errors is a meaningful estimator for the model’s performance.

#### What is a Normality Test?

Assume we have a univariate set of values ( we have a sample from a univariate distribution.) We are checking, based on some calculations with the sample data, if the univariate distribution is a normal one.

Main Idea: We are measuring the nonlinear correlation between the empirical density function of the data vs the theoretical density of a standard normal distribution.

We want to recast this matching procedure onto the backdrop of a linear correlation situation; this means we want to compare the two cumulative distribution functions. To explain, we want the empirical percentiles to correlate linearly with the theoretical percentiles of a standard normal distribution.

### The Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test uses the concept of cummulative distribution functions:

\[CDF(x):= P(X<x) = \int_{-∞}^{x}f(t)dt\]

The concept is useful in applications where we want to check if a random variable follows a certain distribution.

IMPORTANT: In most cases we standardize the values of the random variable, e.g we compute z-scores.

The test is defined as:

H0 (the null hypothesis): The data follow a specified distribution.

H1 (the alternative hypothesis): The data do not follow the specified distribution

The main idea is that we focus on how much the empirical cummulative distribution function is different from the theoretical cummulative distribution function, and we may consider:

\[\sup_{x} |ECDF(x) - CDF(x)|\]

where $ECDF(x)$ means the emprirical cummulative distribution function:

\[ECDF(x):= \frac{1}{n}\sum \mathbb{1}(t)_{t<x}\]

and, $CDF$ stands for the cummulative distribution function:

\[ \begin{align}\begin{aligned}CDF(x):= \int_{-\infty}^{x}f(t)dt.$$ Here $f$ is the probability density function.\\If we order the observations, such as $x_i\leq x_j$ when $i\leq j$, then the test statistic is formally defined by:\\$$D:=\max_{1\leq i\leq n}\left\{CDF(x_i)-\frac{i-1}{n},\frac{i}{n}-CDF(x_i)\right\}\end{aligned}\end{align} \]

The mathematical notation means that we add $1$ for each $t$ less than $x$ and $n$ represents the sample size.

If the p-value is high (much greater then 5%) we do not reject the null hypethesis which means that the normality assumption is not violated.

The Anderson-Darling Test#

The test is defined as:

H0 (the null hypothesis): The data follow a specified distribution.

H1 (the alternative hypothesis): The data do not follow the specified distribution

The test statistic is defined as:

\[\large AD := -n - \sum_{i=1}^{n} \frac{2i-1}{n}\left[\ln(CDF(x_i))+\ln(1-CDF(x_{n+1-i})\right] \]

The critical values for the Anderson-Darling test are dependent on the specific distribution that is being tested.

Error Metrics#

MSE#

\[\text{MSE}:=\frac{1}{n}\sum_{i=1}^{n}(y_i-x_i\cdot\vec{\beta})^2\]

here the i-th observation has multiple features:

\[x_i = \left(x_{i1},x_{i2},...x_{ip}\right)\]

where the “dot” product is defined as

\[x_i\cdot\vec{\beta} = \sum_{j=1}^{p} x_{ij}\beta_j\]

RMSE#

Root mean squared error:

\[\text{RMSE}:=\left(\frac{1}{n}\sum_{i=1}^{n}(y_i-x_i\cdot\vec{\beta})^2\right)^{1/2}\]

MAE#

Mean absolute error:

\[\text{MAE}:=\frac{1}{n}\sum_{i=1}^{n}\left|y_i-x_i\cdot\vec{\beta}\right|\]

### Code Applications

In the following example we learn how to write a code in Python for determining the line of best fit given one dependent variable and one input feature. That is to say we are going to determine a slope 𝑚 and an intercept 𝑛 , the equation of the best fit line being 𝑦=𝑚𝑥+𝑏.

We are going to analyze a real data set that was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-1974 models).

Setup#

import os
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
  from google.colab import drive
  drive.mount('/content/drive')
  os.chdir('/content/drive/My Drive/Data Sets')
  !pip install -q pygam
else:
  print('Running locally')
  os.chdir('../Data')

# Library Setup
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
from __future__ import print_function
from scipy.stats import ksone

Data Import#

# data
cars = pd.read_csv('mtcars.csv')

cars

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
5	Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
6	Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
7	Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
8	Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
9	Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
10	Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
11	Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
12	Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
13	Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
14	Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
15	Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
16	Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
17	Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
18	Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
19	Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
20	Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
21	Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
22	AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
23	Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
24	Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
25	Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
26	Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
27	Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
28	Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
29	Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
30	Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
31	Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

Objective 1: We want to to know if there is any association between the weight of the car and the mileage.

cars['mpg'].values

array([21. , 21. , 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,
       16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4, 30.4, 33.9, 21.5, 15.5,
       15.2, 13.3, 19.2, 27.3, 26. , 30.4, 15.8, 19.7, 15. , 21.4])

y = cars.mpg.values
x = cars.wt.values

model = LinearRegression()

model.fit(x.reshape(-1,1),y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

model.coef_

array([-5.34447157])

Relationship between x and y#

y = cars['mpg'].values
x = cars['wt'].values

xb, yb = np.mean(x) , np.mean(y) # here we compute the center of mass

xb

3.21725

fig, ax = plt.subplots(figsize=(6,4))
ax.scatter(x,y,color='cyan',s=50,ec='black',alpha=0.7)
ax.set_xlim(0, 6)
ax.set_ylim(0, 50)
ax.set_xlabel('Weight of the car in 1000lbs',fontsize=14,color='navy')
ax.set_ylabel('Miles Per Gallon',fontsize=14,color='navy')
ax.grid(which='major', color ='grey', linestyle='-', alpha=0.8)
ax.grid(which='minor', color ='grey', linestyle='--', alpha=0.2)
plt.axvline(x=xb, color='red',linestyle='dashed')
plt.axhline(y=yb, color='red',linestyle='dashed')
plt.tick_params(axis='x', color='navy')
plt.tick_params(axis='y', color='navy')
ax.minorticks_on()
plt.show()

# here we make x a column vector or a matrix
x = x.reshape(-1,1) #technical

model = LinearRegression()
model.fit(x,y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

m = model.coef_
b = model.intercept_

print('The Slope is :'+str(m),' and the intercept is :' +str(b))

The Slope is :[-5.34447157]  and the intercept is :37.28512616734204

from scipy.stats import pearsonr

# we have to "ravel" back the x
x = x.ravel()

r, pval = pearsonr(x,y)
print('The correlation coefficient is :'+str(r)+'  and the pvalue for significance is :'+str(pval))

The correlation coefficient is :-0.8676593765172278  and the pvalue for significance is :1.293958701350513e-10

m = r*np.std(y)/np.std(x)

print('The slope of the trend is :' +str(m))

The slope of the trend is :-5.344471572722676

The equation of the line passing through the center of mass and that captires the trend is: $$y-\bar{y} = m\cdot(x-\bar{x})$$

In short, $$y=m\cdot x + b$$ where $b = \bar{y} -m\cdot\bar{x}$

b = yb-m*xb
print('In our case the intercept is: ' +str(b)) # what is the meaning of the intercept?

In our case the intercept is: 37.285126167342035

### OLS Equivalence

The main idea is that in 1-D the OLS estimation is equivalence with the calculation of the slope via the Pearson correlation coefficient.

def loss(b, m, data): # this is also called "loss" function
    total_cost = 0

    # number of datapoints in training data
    N = float(len(data))

    # Compute sum of squared errors
    for i in range(0, len(data)):
        x = data[i, 0]
        y = data[i, 1]
        total_cost += (y - (m * x + b)) ** 2

    # Return average of squared error
    return total_cost/N

# this is the actual gradient descent
def step_gradient(b_current, m_current, data, alpha):
    """takes one step down towards the minima

    Args:
        b_current (float): current value of b
        m_current (float): current value of m
        data (np.array): array containing the training data (x,y)
        alpha (float): learning rate / step size

    Returns:
        tuple: (b,m) new values of b,m
    """

    m_gradient = 0
    b_gradient = 0
    N = float(len(data))

    # Calculate Gradient - assuming you know partial derivatives
    for i in range(0, len(data)):
        x = data[i, 0]
        y = data[i, 1]
        m_gradient += - (2/N) * x * (y - (m_current * x + b_current)) # is the partial derivative with respect to m
        b_gradient += - (2/N) * (y - (m_current * x + b_current)) # is the partial derivative with respect to b

    # Update current m and b, alpha stands for learning rate
    # we proceed in the direction of the negative gradient
    m_updated = m_current - alpha * m_gradient
    b_updated = b_current - alpha * b_gradient

    #Return updated parameters
    return b_updated, m_updated

def gradient_descent(data, starting_b, starting_m, learning_rate, num_iterations):
    """runs gradient descent

    Args:
        data (np.array): training data, containing x,y
        starting_b (float): initial value of b (random)
        starting_m (float): initial value of m (random)
        learning_rate (float): hyperparameter to adjust the step size during descent
        num_iterations (int): hyperparameter, decides the number of iterations for which gradient descent would run

    Returns:
        list : the first and second item are b, m respectively at which the best fit curve is obtained, the third and fourth items are two lists, which store the value of b,m as gradient descent proceeded.
    """

    # initial values
    b = starting_b
    m = starting_m

    # to store the cost after each iteration
    cost_graph = []

    # to store the value of b -> bias unit, m-> slope of line after each iteration (pred = m*x + b)
    b_progress = []
    m_progress = []

    # For every iteration, optimize b, m and compute its cost
    for i in range(num_iterations):
        cost_graph.append(loss(b, m, data))
        b, m = step_gradient(b, m, data, learning_rate)
        b_progress.append(b)
        m_progress.append(m)

    return [b, m, cost_graph,b_progress,m_progress]

#hyperparamters
learning_rate = 0.001
initial_b = 12
initial_m = 3
num_iterations = 50000
data = np.column_stack([x,y])

b, m, cost_graph,b_progress,m_progress = gradient_descent(data, initial_b, initial_m, learning_rate, num_iterations)

#Print optimized parameters
print ('Optimized b:', b)
print ('Optimized m:', m)

#Print error with optimized parameters
print ('Minimized cost:', loss(b, m, data))

Optimized b: 37.27234604666433
Optimized m: -5.340801130745784
Minimized cost: 8.697573986717137

fig, ax = plt.subplots(figsize=(10,6))
plt.plot(cost_graph)
ax.set_xlabel('No. of iterations',fontsize=14,color='blue')
ax.set_ylabel('Cost',fontsize=14,color='blue')
plt.title('Cost per iteration',color='red',fontsize=14)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
ax.tick_params(axis='x', colors='forestgreen')
ax.tick_params(axis='y', colors='forestgreen')
plt.show()

_images/70c7a3803dadf6b2f7f7fb687a93599571256f5b6f175ec66f9e66137dafee8d.png

from sklearn.linear_model import LinearRegression

# what is the shape of x?
x.shape

(32,)

model = LinearRegression()
model.fit(x.reshape(-1,1),y)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

model.coef_ , model.intercept_ # this is what the has learned

(array([-5.34447157]), 37.28512616734204)

x_range = np.linspace(np.min(x)-1,np.max(x)+1,2)
yhat = model.predict(x_range.reshape(-1,1))

#Plot dataset
plt.scatter(x, y,ec='k',color='cyan',alpha=0.6)
#Predict y values
pred = model.coef_ * x_range + model.intercept_
#Plot predictions as line of best fit
plt.plot(x_range, pred, c='g')
plt.xlim(0, 6)
plt.ylim(0, 50)
plt.xlabel('Independent Variable $X$',fontsize=14,color='navy')
plt.ylabel('Dependent Variable $Y$',fontsize=14,color='navy')
plt.grid(which='major', color ='grey', linestyle='-', alpha=0.8)
plt.grid(which='minor', color ='grey', linestyle='--', alpha=0.2)
plt.axvline(x=xb, color='red',linestyle='dashed')
plt.axhline(y=yb, color='purple',linestyle='dashed')
plt.tick_params(axis='x', color='navy')
plt.tick_params(axis='y', color='navy')
plt.minorticks_on()
plt.title('Line of best fit',fontsize=14)
plt.legend(['Data','Trend','Mean of $X$','Mean of $Y$'])
plt.savefig('lineacorrelation.png',dpi=300)
plt.show()

model.score(x.reshape(-1,1),y)

0.7528327936582646

r**2

0.7528327936582647

###Test the Residuals for Goodness of fit

We investigate the distribution of the residuals, plot a histogram and apply a normality test

residuals = y - model.predict(x.reshape(-1,1))

import seaborn as sns

# import uniform distribution
from scipy import stats
from scipy.stats import norm

ax1 = sns.distplot(residuals,

                  bins=8,
                  kde=False,
                  color='deepskyblue',
                  hist_kws={"color":'green','ec':'black'},
                  fit=stats.norm,
                  fit_kws={"color":'red'})
ax1.set(xlabel='Residuals', ylabel='Frequency')
plt.show()

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

_images/2245671c05d666182e5008d5d4b94e651f1d62d890c536143c51e7c2f40fe6d1.png

Message: The histogram does not quite look like a normal distribution. We can also consider a Q-Q Plot:#

import statsmodels.api as sm
sm.qqplot(residuals/np.std(residuals), loc = 0, scale = 1, line='s',alpha=0.5)
plt.xlim([-2.5,2.5])
plt.ylim([-2.5,2.5])
plt.axes().set_aspect('equal')
plt.grid(b=True,which='major', color ='grey', linestyle='-', alpha=0.5)
plt.grid(b=True,which='minor', color ='grey', linestyle='--', alpha=0.15)
plt.minorticks_on()
plt.show()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  """

_images/d998aa9b331ad02c94f95d2a93028936634f9b7e3b8ab90f4e9c378ed953320a.png

dist = getattr(stats, 'norm')
params = dist.fit(residuals)

params

(-8.215650382226158e-15, 2.949162685955028)

# here we apply the test
stats.kstest(residuals, "norm", params)

KstestResult(statistic=0.08217402470387336, pvalue=0.9821261392158506)

The conclusion, in our case, is that the normality assumption is not violated.#

# for your convenience, this is the table with the critical values for the
# Kolmogorov-Smirnov test

def ks_critical_value(n_trials, alpha):
    return ksone.ppf(1-alpha/2, n_trials)

trials = range(1, 40)
alphas = [0.1, 0.05, 0.02, 0.01]

# Print table headers
print('{:<6}|{:<6} Level of significance, alpha'.format(' ', ' '))
print('{:<6}|{:>8} {:>8} {:>8} {:>8}'.format(*['Trials'] + alphas))
print('-' * 42)
# Print critical values for each n_trials x alpha combination
for t in trials:
    print('{:6d}|{:>8.5f} {:>8.5f} {:>8.5f} {:>8.5f}'
          .format(*[t] + [ks_critical_value(t, a) for a in alphas]))
    if t % 10 == 0:
        print()

ks_critical_value(31,0.4)

0.15594527177147208

A different normality test indicates the same thing#

stats.anderson(residuals,dist='norm')

AndersonResult(statistic=0.46842144463906266, critical_values=array([0.523, 0.596, 0.715, 0.834, 0.992]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

stats.shapiro(residuals)

(0.9450768828392029, 0.10438867658376694)

#@title
model = sm.OLS(y, x)
results = model.fit()
print(results.summary())

x = cars.iloc[:,2:8]
x

	cyl	disp	hp	drat	wt	qsec
0	6	160.0	110	3.90	2.620	16.46
1	6	160.0	110	3.90	2.875	17.02
2	4	108.0	93	3.85	2.320	18.61
3	6	258.0	110	3.08	3.215	19.44
4	8	360.0	175	3.15	3.440	17.02
5	6	225.0	105	2.76	3.460	20.22
6	8	360.0	245	3.21	3.570	15.84
7	4	146.7	62	3.69	3.190	20.00
8	4	140.8	95	3.92	3.150	22.90
9	6	167.6	123	3.92	3.440	18.30
10	6	167.6	123	3.92	3.440	18.90
11	8	275.8	180	3.07	4.070	17.40
12	8	275.8	180	3.07	3.730	17.60
13	8	275.8	180	3.07	3.780	18.00
14	8	472.0	205	2.93	5.250	17.98
15	8	460.0	215	3.00	5.424	17.82
16	8	440.0	230	3.23	5.345	17.42
17	4	78.7	66	4.08	2.200	19.47
18	4	75.7	52	4.93	1.615	18.52
19	4	71.1	65	4.22	1.835	19.90
20	4	120.1	97	3.70	2.465	20.01
21	8	318.0	150	2.76	3.520	16.87
22	8	304.0	150	3.15	3.435	17.30
23	8	350.0	245	3.73	3.840	15.41
24	8	400.0	175	3.08	3.845	17.05
25	4	79.0	66	4.08	1.935	18.90
26	4	120.3	91	4.43	2.140	16.70
27	4	95.1	113	3.77	1.513	16.90
28	8	351.0	264	4.22	3.170	14.50
29	6	145.0	175	3.62	2.770	15.50
30	8	301.0	335	3.54	3.570	14.60
31	4	121.0	109	4.11	2.780	18.60

model.fit(x,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

model.score(x,y)

0.8548224115848234

residuals = y - model.predict(x)

sns.distplot(residuals,
                  bins=8,
                  kde=False,
                  color='deepskyblue',
                  hist_kws={"color":'green','ec':'black'},
                  fit=stats.norm,
                  fit_kws={"color":'red'})
plt.show()

/usr/local/lib/python3.7/dist-packages/seaborn/distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

_images/cbfaa29ed14bbfe217b98d18cec9e69932734a97d52d6d647233a94b2b7182c9.png

dist = getattr(stats, 'norm')
params = dist.fit(residuals)

# here we apply the Kolmogorov-Smirnov test
stats.kstest(residuals, "norm", params) # the confidence has increased compared to the case of using only one input feature.

KstestResult(statistic=0.15722261669146254, pvalue=0.37091468826166857)

Supervised Machine Learning: Linear Regression

Contents

Supervised Machine Learning: Linear Regression#

Introduction #

Multiple Linear Regression (Linear models with more features)#

Vector Spaces #

Linear vs Non-linear models#

We have a vector of weights:#

#

We want to learn the weights \(\beta_1,\beta_2,...\beta_p\) that minimize the sum of the squared residuals:#

How do we know we are on the right track after we perform the minimization of the square residuals? #

The Anderson-Darling Test#

Error Metrics#

MSE#

RMSE#

MAE#

Setup#

Data Import#

Relationship between x and y#

Message: The histogram does not quite look like a normal distribution. We can also consider a Q-Q Plot:#

The conclusion, in our case, is that the normality assumption is not violated.#

A different normality test indicates the same thing#