An Introduction to the Multilayer Perceptron

An Introduction to the Multilayer Perceptron #

Introduction #

The related research started in late 1950s (F. Rosenblatt) with the “Mark I Perceptron”: the goal was to automatically detect capital letters.

Image source: https://jennysmoore.wordpress.com/2014/03/31/march-31-network-society-readings/

The Artificial Neuron #

Example of a neural network with five neurons:

It is a nature-inspired design. Check out the playground.

Fact: A frog has enough neurons to learn how to drive your car, but if it did, it would occupy its entire memory and it would not know how to feed itself.

Brief History of Development #

Neural Networks have been a success for computer vision, image analysis and classification problems; however we can use the method for regression, as well.

It was able to identify capital letters.
From 1960 - 1986, things progressed relatively slowly.
D. Rumelhart, G. Hinton and R. Williams were able to achieve the first back-propagation network (Seminal paper “Learning Representations by back-propagating errors” Nature Vol. 323, 1986 ).
1990, P. Werbos, “Backpropagation Through Time: What It Does and How to Do It”
G. Hinton and R. Salakhutdinov examined the capability of neural nets for image recognition in 2006. Explosive research into neural nets from 2006 - today.

Activation Functions #

Examples of activation functions:

Name	Equation	Derivative
Identity	$$ f(x) = x $$	$$ f'(x) = 1 $$
Binary step	$$ f(x) = \begin{cases} 0 & \text{for } x < 0 \\ 1 & \text{for } x \ge 0 \end{cases} $$	$$ f'(x) = \begin{cases} 0 & \text{for } x \ne 0 \\ ? & \text{for } x = 0 \end{cases} $$
Logistic (a.k.a Soft step)	$$ f(x) = \frac{1}{1 + e^{-x}} $$	$$ f'(x) = f(x)(1 - f(x)) $$
TanH	$$ f(x) = \tanh(x) = \frac{2}{1 + e^{-2x}} - 1 $$	$$ f'(x) = 1 - f(x)^2 $$
ArcTan	$$ f(x) = \tan^{-1}(x) $$	$$ f'(x) = \frac{1}{x^2 + 1} $$
Rectified Linear Unit (ReLU)	$$ f(x) = \begin{cases} 0 & \text{for } x < 0 \\ x & \text{for } x \ge 0 \end{cases} $$	$$ f'(x) = \begin{cases} 0 & \text{for } x < 0 \\ 1 & \text{for } x \ge 0 \end{cases} $$
Parametric Rectified Linear Unit (PReLU)	$$ f(x) = \begin{cases} \alpha x & \text{for } x < 0 \\ x & \text{for } x \ge 0 \end{cases} $$	$$ f'(x) = \begin{cases} \alpha & \text{for } x < 0 \\ 1 & \text{for } x \ge 0 \end{cases} $$
Exponential Linear Unit (ELU)	$$ f(x) = \begin{cases} \alpha (e^x - 1) & \text{for } x < 0 \\ x & \text{for } x \ge 0 \end{cases} $$	$$ f'(x) = \begin{cases} f(x) + \alpha & \text{for } x < 0 \\ 1 & \text{for } x \ge 0 \end{cases} $$
SoftPlus	$$ f(x) = \log(1 + e^x) $$	$$ f'(x) = \frac{1}{1 + e^{-x}} $$
Swish	$$f(x) = \frac{x}{1+e^-{\beta x}}$$	$$f'(x)=f(x)\cdot\left[1+\beta - \beta\cdot\sigma(\beta x)\right]$$ where $\sigma$ is the logistic sigmoid.

Backpropagation #

Backpropagation is the process of updating the weights in a direction that is based on the calculation of the gradient of the loss function.

Gradient Descent Methods for Optimization #

All the current optimization algorithms are based on a variant of gradient descent.

Let’s denote the vector of weights at step $t$ by $w_t$ and the gradient of the objective function with respect to the weights by $g_t$. The idea is that the gradient descent algorithm updates the weights under the following principle:

\[\large w_t = w_{t-1} - \eta\cdot g_{t,t-1}\]

When the objective function (whose gradient with respect to the weights) is represented by $g_t$ and has multiple local minima, or it has a very shallow region containing the minima, the plain gradient descent algorithm may not converge to the position sought for. To remediate this deficiency research proposed alternatives by varying the way we evaluate the learning rate each step or how we compute the “velocity” for updating the weights:

$$\Large w_t = w_{t-1} - \eta_t\cdot v_t$$

In the equation above, $\eta_t$ is an adaptive learning rate and $v_t$ a modified gradient.

Dynamic Learning Rates #

We can consider an exponential decay, such as

\[\large \eta_t = \eta_0 e^{-\lambda\cdot t}\]

or a polynomial decay

\[\large \eta_t = \eta_0 (\beta t+1)^{-\alpha}\]

Momentum Gradient Descent #

\[\large g_{t,t-1} = \partial_w \frac{1}{|\text{B}_t|}\sum_{i\in \text{B}_t}f(x_i,w_{t-1})=\frac{1}{|\text{B}_t|}\sum_{i\in \text{B}_t}h_{i,t-1}\]

where

\[\large v_t = \beta v_{t-1} + g_{t,t-1}\]

and $\beta\in (0,1).$

For an explicit formula, we have

\[\large v_t = \sum_{\tau=0}^{t-1} \beta^{\tau}g_{t-\tau,t-\tau-1}\]

and

\[\large w_t = w_{t-1} - \alpha v_t\]

where

\[\large \alpha = \frac{\eta}{1-\beta}\]

AdaGrad (Adaptive Gradient Descent) #

\[\large s_t = s_{t-1} + g_{t}^2\]

and

\[\large w_t= w_{t-1} - \frac{\eta}{\sqrt{s_t+\epsilon}}\cdot g_t\]

RMSProp (Root Mean Square Propagation) #

\[\large s_t = \gamma\cdot s_{t-1} + (1-\gamma)\cdot g_{t}^2\]

and

\[\large w_t= w_{t-1} - \frac{\eta}{\sqrt{s_t+\epsilon}}\cdot g_t\]

Thus, we have

\[\large s_t = (1-\gamma)\cdot (g_t^2+\gamma g_{t-1}^2+\gamma^2 g_{t-2} +\gamma^3 g_{t-2} + ... + \gamma^{\tau} g_0)\]

ADAM (Adaptive Momentum Gradient Descent) #

\[\large v_t = \beta_1 v_{t-1} +(1-\beta_1) g_t\]

and

\[\large s_t = \beta_2 s_{t-1} + (1-\beta_2) g_t^2 \]

We further consider

\[\large \hat{v}_t = \frac{v_t}{1-\beta_1}, \text{and } \hat{s}_t = \frac{s_t}{1-\beta_2} \]

and

\[\large \hat{g}_{t} = \frac{\eta\cdot \hat{v}_t}{\sqrt{\hat{s}_t}+\epsilon}\]

The updates to the weights are implemented as follows

\[\large w_t = w_{t-1} - \hat{g}_t\]

Multilayer Perceptron Instructional Videos #

The first 2 are required. The last 2 are optional but highly recommended, even if you have not had any calculus or linear algebra!

Test Your Understanding #

What is a multilayer perceptron?
What is a hidden layer?
The network in the video was on the small-ish side, having only 2 hidden layers with 16 neurons each. How many total parameters (i.e. weights and biases) have to be determined during the training process for this network?
Without reference to the calculus involved, do you understand the concept of gradient descent?

Code Applications #

References:#

Setup#

import os
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
  from google.colab import drive
  drive.mount('/content/drive')
  os.chdir('/content/drive/My Drive/Data Sets')
else:
  print('Running locally')
  os.chdir('../Data')

Running on CoLab
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

import torch
import torch.nn as nn
import torch.optim as optim

import numpy as np
import matplotlib.pyplot as plt


from torch.utils.data import DataLoader, TensorDataset
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

Example of a Neural Net in PyTorch for Classification#

# Load and preprocess the data
wine_data = load_wine()
X = wine_data.data
y = wine_data.target

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Convert the data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float64)
X_test_tensor = torch.tensor(X_test, dtype=torch.float64)
y_train_tensor = torch.tensor(y_train, dtype=torch.long)
y_test_tensor = torch.tensor(y_test, dtype=torch.long)

# Create DataLoader for training and testing
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Define a simple neural network
class WineNet(nn.Module):
    def __init__(self,n_features):
        super(WineNet, self).__init__()
        self.fc1 = nn.Linear(n_features, 50).double()
        self.fc2 = nn.Linear(50, 30).double()
        self.fc3 = nn.Linear(30, 3).double()

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialize the model, loss function, and optimizer
model = WineNet(X_train.shape[1])
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        outputs = model(X_batch)
        loss = criterion(outputs, y_batch)
        loss.backward()
        optimizer.step()

    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Evaluate the model on the test set
model.eval()
with torch.no_grad():
    y_pred_list = []
    y_true_list = []
    for X_batch, y_batch in test_loader:
        outputs = model(X_batch)
        _, y_pred = torch.max(outputs, 1)
        y_pred_list.append(y_pred)
        y_true_list.append(y_batch)

    y_pred = torch.cat(y_pred_list)
    y_true = torch.cat(y_true_list)
    accuracy = accuracy_score(y_true.numpy(), y_pred.numpy())
    print(f'Accuracy on test set: {accuracy:.4f}')

Epoch [10/100], Loss: 0.5738
Epoch [20/100], Loss: 0.1444
Epoch [30/100], Loss: 0.0459
Epoch [40/100], Loss: 0.0263
Epoch [50/100], Loss: 0.0081
Epoch [60/100], Loss: 0.0122
Epoch [70/100], Loss: 0.0090
Epoch [80/100], Loss: 0.0048
Epoch [90/100], Loss: 0.0050
Epoch [100/100], Loss: 0.0043
Accuracy on test set: 0.9815

Example of Classification w/ Class Imbalance#

Reference: curiousily/Getting-Things-Done-with-Pytorch

Name	Equation	Derivative
Identity	$\( f(x) = x \)$	$\( f'(x) = 1 \)$
Binary step	$\( f(x) = \begin{cases} 0 & \text{for } x < 0 \\ 1 & \text{for } x \ge 0 \end{cases} \)$	$\( f'(x) = \begin{cases} 0 & \text{for } x \ne 0 \\ ? & \text{for } x = 0 \end{cases} \)$
Logistic (a.k.a Soft step)	$\( f(x) = \frac{1}{1 + e^{-x}} \)$	$\( f'(x) = f(x)(1 - f(x)) \)$
TanH	$\( f(x) = \tanh(x) = \frac{2}{1 + e^{-2x}} - 1 \)$	$\( f'(x) = 1 - f(x)^2 \)$
ArcTan	$\( f(x) = \tan^{-1}(x) \)$	$\( f'(x) = \frac{1}{x^2 + 1} \)$
Rectified Linear Unit (ReLU)	$\( f(x) = \begin{cases} 0 & \text{for } x < 0 \\ x & \text{for } x \ge 0 \end{cases} \)$	$\( f'(x) = \begin{cases} 0 & \text{for } x < 0 \\ 1 & \text{for } x \ge 0 \end{cases} \)$
Parametric Rectified Linear Unit (PReLU)	$\( f(x) = \begin{cases} \alpha x & \text{for } x < 0 \\ x & \text{for } x \ge 0 \end{cases} \)$	$\( f'(x) = \begin{cases} \alpha & \text{for } x < 0 \\ 1 & \text{for } x \ge 0 \end{cases} \)$
Exponential Linear Unit (ELU)	$\( f(x) = \begin{cases} \alpha (e^x - 1) & \text{for } x < 0 \\ x & \text{for } x \ge 0 \end{cases} \)$	$\( f'(x) = \begin{cases} f(x) + \alpha & \text{for } x < 0 \\ 1 & \text{for } x \ge 0 \end{cases} \)$
SoftPlus	$\( f(x) = \log(1 + e^x) \)$	$\( f'(x) = \frac{1}{1 + e^{-x}} \)$
Swish	$\(f(x) = \frac{x}{1+e^-{\beta x}}\)$	$\(f'(x)=f(x)\cdot\left[1+\beta - \beta\cdot\sigma(\beta x)\right]\)\( where \)\sigma$ is the logistic sigmoid.