Artificial Neural Networks (ANN)

DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

Introduction

Artificial Neural Networks are found primarily in the field of Artificial Intelligence (AI)

"The theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.”

What was considered AI changed over time

AI-- since 2011

 Current state of AI

Future of AI

NLP  

Traditional (approach used in the 90s-2000's)

Today’s NLP approach use NN

 


Biological mimicry

ANNs mimic the brain and its interconnection of neurons. The connection of neurons is through axons.

The neuron can be thought of having a multiple connections of axons from other neurons through a dentrite that has a gap or synapse that controls the strength of signal through the axon.

Based on the input signals from the axons, the neuron can fire a signal to an output axon that other neurons use as inputs. Of course the axons are like inputs and outputs.

The human brain has 100 billion neurons. An individual neuron doesn't do that much but the composition of the interconnection and number of neurons allow for complex thinking.

 

 

NN in Data Mining

ANNs are able to serve as another form of classification (supervised) that can be used on a larger variety of data input.

 

In an ANN the neuron is simulated with a perceptron or node that essentially takes the inputs and calculates an output based on weights of the inputs.

 

Example of a task: Output Y is 1 if at least two of the three inputs are equal to 1.

We want the network to learn this function. Of course, where we really use this is to learn to classify complicated structures, images, data sets, etc.

We add weights to each of the inputs. Sum the weights and compare to the threshold t to determine what the node outputs.

The ANN model is an assembly/network of inter-connected nodes and weighted links.

The output node sums up each of its input value according to the weights of its links and then compares the sum against some threshold t to determine one of two output values. The neuron "fires" or not.

These two functions are the same. The second one defines the weight w0 = -t and X0 = 1 to streamline the summation.


General Structure of an ANN

 


Various types of neural network topologies

single-layered network (perceptron)

multi-layered network -- there can be many layers at the hidden stage

feed-forward -- output of each layer is connected as inputs to the next layer

recurrent network -- interconnections may cycle back

 

Each output is binary.  If you want to have different classifications, you'll  need more outputs to code those possibilities as shown to the right.

 

 

 

 

 


Various types of activation functions (f)

Sum the weighted inputs of the node and apply the activation function


Perceptron

Single layer network: Contains only input and output nodes

Activation function: f = sign(wTX) //dot product of weights and inputs, as above

Applying model is straightforward

X1 = 1, X2 = 0, X3 =1 as inputs will be computed as

y = sign(0.3 + 0.0 + 0.3 - 0.4) = sign(0.2) = 1

 


Perceptron Learning Algorithm

Initialize the weights (w0, w1, …, wd) at 0 or randomize.

Repeat

Until stopping condition is met

λ is the learning rate 0 ≤ λ ≤ 1 and controls the rate of change to the weights

Intuition:

Update the weight based on error:

The left table are the 8 instances with X0=1

Middle table show the steps of how the weights are reset for each instance.

The right Epoch table shows the updates after applying corrections to the weights again after running the instances through the learning cycle.

Example calculation details:

Instance 1: f((0,0,0,0), (1,1,0,0)) = 0, since Y= -1 < f(w,x)=1, then e =(-1+-1) -2 and λe = -0.2
w0 = 0 + -0.2*1 = -0.2
w1 = 0 + -0.2*1 = -0.2
w2 = 0 + -0.2*0 = 0
w3 = 0 + -0.2*0 = 0

Instance 2: f ( (-0.2,-0.2,0,0), (1,1,0,1)) = -0.2 + -0.2 + 0 + 0 = -0.4, since Y= 1 > f(w,x)=-1, then e = 1-(-1)=2 and λe = 0.2
w0 = -0.2 + 0.2*1 = 0
w1 = -0.2 + 0.2*1 = 0
w2 = 0 + 0.2*0 = 0
w3 = 0 + 0.2*1 = 0.2

Instance 3: f ( (0,0,0,0.2), (1,1,1,0)) = 0 + 0 + 0 + 0 = 0, since Y= 1 = f(w,x)=1, then e = 0 and λe = 0
w0 = 0 + 0.0*1 = 0.0
w1 = 0 + 0.0*1 = 0.0
w2 = 0 + 0.0*1 = 0.0
w3 = 0.2 + 0.0*0 = 0.2

And so forth. Below is a Python implementation of the perceptron learning algorithm


Python code for testing perceptron learning

 

def sign(x):
    if x<0:
        return -1
    else:
        return 1

def perceptron(x,y,lamb):
    w = []
    for i in range(len(x[0])):
        w.append(0)
    for epoch in range(6):

        for i in range(len(x)):
            yhat = 0
            for j in range (len(x[i])):
                yhat += x[i][j] * w[j]
            err = y[i] - sign(yhat)
            for j in range (len(x[i])):
                w[j] +=  lamb*err * x[i][j]
            print (x[i], w)
        print (epoch+1,w)
        
























x = [[1, 1, 0, 0],
 [1, 1, 0, 1],
 [1, 1, 1, 0],
 [1, 1, 1, 1],
 [1, 0, 0, 1],
 [1, 0, 1, 0],
 [1, 0, 1, 1],
 [1, 0, 0, 0]]

y = [-1, 1, 1, 1, -1, -1, 1, -1]

perceptron(x,y,0.1)

[1, 1, 0, 0] [-0.2, -0.2, 0.0, 0.0]
[1, 1, 0, 1] [0.0, 0.0, 0.0, 0.2]
[1, 1, 1, 0] [0.0, 0.0, 0.0, 0.2]
[1, 1, 1, 1] [0.0, 0.0, 0.0, 0.2]
[1, 0, 0, 1] [-0.2, 0.0, 0.0, 0.0]
[1, 0, 1, 0] [-0.2, 0.0, 0.0, 0.0]
[1, 0, 1, 1] [0.0, 0.0, 0.2, 0.2]
[1, 0, 0, 0] [-0.2, 0.0, 0.2, 0.2]
1 [-0.2, 0.0, 0.2, 0.2]
[1, 1, 0, 0] [-0.2, 0.0, 0.2, 0.2]
[1, 1, 0, 1] [-0.2, 0.0, 0.2, 0.2]
[1, 1, 1, 0] [-0.2, 0.0, 0.2, 0.2]
[1, 1, 1, 1] [-0.2, 0.0, 0.2, 0.2]
[1, 0, 0, 1] [-0.4, 0.0, 0.2, 0.0]
[1, 0, 1, 0] [-0.4, 0.0, 0.2, 0.0]
[1, 0, 1, 1] [-0.2, 0.0, 0.4, 0.2]
[1, 0, 0, 0] [-0.2, 0.0, 0.4, 0.2]
2 [-0.2, 0.0, 0.4, 0.2]
[1, 1, 0, 0] [-0.2, 0.0, 0.4, 0.2]
[1, 1, 0, 1] [-0.2, 0.0, 0.4, 0.2]
[1, 1, 1, 0] [-0.2, 0.0, 0.4, 0.2]
[1, 1, 1, 1] [-0.2, 0.0, 0.4, 0.2]
[1, 0, 0, 1] [-0.4, 0.0, 0.4, 0.0]
[1, 0, 1, 0] [-0.6, 0.0, 0.2, 0.0]
[1, 0, 1, 1] [-0.4, 0.0, 0.4, 0.2]
[1, 0, 0, 0] [-0.4, 0.0, 0.4, 0.2]
3 [-0.4, 0.0, 0.4, 0.2]
[1, 1, 0, 0] [-0.4, 0.0, 0.4, 0.2]
[1, 1, 0, 1] [-0.2, 0.2, 0.4, 0.4]
[1, 1, 1, 0] [-0.2, 0.2, 0.4, 0.4]
[1, 1, 1, 1] [-0.2, 0.2, 0.4, 0.4]
[1, 0, 0, 1] [-0.4, 0.2, 0.4, 0.2]
[1, 0, 1, 0] [-0.4, 0.2, 0.4, 0.2]
[1, 0, 1, 1] [-0.4, 0.2, 0.4, 0.2]
[1, 0, 0, 0] [-0.4, 0.2, 0.4, 0.2]
4 [-0.4, 0.2, 0.4, 0.2]
[1, 1, 0, 0] [-0.4, 0.2, 0.4, 0.2]
[1, 1, 0, 1] [-0.2, 0.4, 0.4, 0.4]
[1, 1, 1, 0] [-0.2, 0.4, 0.4, 0.4]
[1, 1, 1, 1] [-0.2, 0.4, 0.4, 0.4]
[1, 0, 0, 1] [-0.4, 0.4, 0.4, 0.2]
[1, 0, 1, 0] [-0.4, 0.4, 0.4, 0.2]
[1, 0, 0, 0] [-0.4, 0.4, 0.4, 0.2]
5 [-0.4, 0.4, 0.4, 0.2]
[1, 1, 0, 0] [-0.4, 0.4, 0.4, 0.2]
[1, 1, 0, 1] [-0.4, 0.4, 0.4, 0.2]
[1, 1, 1, 0] [-0.4, 0.4, 0.4, 0.2]
[1, 1, 1, 1] [-0.4, 0.4, 0.4, 0.2]
[1, 0, 0, 1] [-0.4, 0.4, 0.4, 0.2]
[1, 0, 1, 0] [-0.4, 0.4, 0.4, 0.2]
[1, 0, 1, 1] [-0.4, 0.4, 0.4, 0.2]
[1, 0, 0, 0] [-0.4, 0.4, 0.4, 0.2]
6 [-0.4, 0.4, 0.4, 0.2]

 


Graphical interpretation of the process

Since f(w,x) is a linear combination of input variables, the decision boundary is linear.

For nonlinearly separable problems, perceptron learning algorithm will fail because no linear hyperplane can separate the data perfectly.

 

Classic example of non-linearly separable data

 

Let's back up a little. We didn't really talk about this approach of classification-- to establish a demarcation line between groups, or clusters of instances and represesnt it as an equation

+ = Iris versicolor
x =Iris setosa

Of course this concept can be extended to higher dimension, thinking of the line as a hyperplane.

The linear equation representing this line is f() = 2.0 – 0.5 * PetalLength – 0.8 * PetalWidth = 0

if f() >= 0 : Iris setosa
     else: Iris versicolor
2.0 =bias
0.5 and 0.8 = weights

Task: Find values for the weights and the bias, so that training data is correctly classified by line (equation).

In general: y = w0 + w1a1 + w2a2 + ... + wkak

Where:
y: class
ai, i = 1 ... k: attribute values
wj, j = 0 ... k: weights

Task: Find optimal weights wj to separate the different classes

Weights are calculated from training data

Mathematical technique: Linear optimization

 


Multilayer Neural Network

Perceptrons have no middle, hidden layers.

Hidden layers: intermediary layers between input & output layers

More general activation functions (sigmoid, linear, etc) are typically used

Multi-layer neural network can solve any type of classification task involving nonlinear decision surfaces

Can we apply perceptron learning rule to each node, including hidden nodes?

Problem: how to determine the true value of y for hidden nodes?

Approximate error in hidden nodes by error in the output nodes

But the problem is:


Gradient Descent for Multilayer NN

Weight update:

Error function:

Activation function f must be differentiable and is the reason the sign function is not used

For sigmoid function:

Stochastic gradient descent (update the weight immediately)

For output neurons, weight update formula is the same as before (gradient descent for perceptron)

For hidden neurons:


Design Issues in ANN

Number of nodes in input layer

Number of nodes in output layer

Number of nodes in hidden layer

Initial weights and biases


Characteristics of ANN

Advantages and disadvantages

Multilayer ANN are universal approximators but could suffer from overfitting if the network is too large

Gradient descent may converge to local minimum

Model building can be very time consuming, but testing can be very fast

Can handle redundant attributes because weights are automatically learned

Sensitive to noise in training data

Difficult to handle missing attributes

Recent developments

Use in deep learning and unsupervised feature learning

Google Brain project