DM 352 Syllabus | DM 552 Syllabus

last updated 23-Jun-2021

**Artificial Neural Networks** are found primarily in the field of **Artificial Intelligence** (AI)

"The theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.”

What was considered AI changed over time

- Computing in the 50s was AI
- High level computing languages in the 60s
- Deduction systems in the 70s
- Early NLP in the 90s and 2000s
- Expert systems
- Language translation, spoken language recognition
- Visual recognition
## AI-- since 2011

Deep learning through neural networks (NN)Big dataproviding basis for modeling systems- Artificial general intelligence (AGI)

–Robotics

–Speech recognition

–Natural language processing (NLP)## Current state of AI

- Game playing

–1996 IBM Big Blue beats Kasparov in chess

–2007 checkers is solved

–2011 IBM Watson beats Rutter and Jennings in Jeopardy (2 of the players in last night’s GOAT)

- Expert systems

–2019 mammogram analysis by deep learning outperformed doctors (but can’t explain analysis)

- Can outperform but can fail miserably

–Visual systems can be fooled- Today’s NN tools make them easy to build.

–TensorFlow

–Many NN libraries built into data science tools•- Variety of NN arrangements possible

–Convoluted NNs show serious promise in image processing

- Smart phones, smart speakers, image recognition commonplace

- Full language translation
## Future of AI

- Tesla and driverless cars

- Smart home devices

- Close to passing the Turing test

–Interaction with a computer is indistinguishable from a human

- Ray Kurzweil predicts 2029 as AI passing TT

- RK also predicts singularity in 2045

– machines exceed human intelligence

Traditional (approach used in the 90s-2000's)

- Communication with humans primary characteristic
- Levels of NLP for “understanding” and translation

–**Lexical**– separating the words, choose between the homonyms

–**Syntactic**– fit words and phrases into structure

–**Semantics**– determine meaning in context

Today’s NLP approach use NN

- Wealth of documents (big data) already translated in the target languages are digitally available.
- Use neural networks (NN) to ”learn” the patterns in one language to predict the patterns in another.
- NNs don’t explicitly construct the syntax trees but recognize the common patterns (no understanding)
- Require huge numbers of examples for training.
- BIG DATA
- Speech synthesis (easiest of the steps)
- Understand spoken words as well as written (Alexa, Siri)
- NLP systems must consider all levels of words simultaneously (lexical, syntactic, semantic)
- N-grams: 1 word, 2- 3- 4-… word phrases considered simultaneously

ANNs mimic the brain and its interconnection of **neurons**. The connection of neurons is through **axons**.

The neuron can be thought of having a multiple connections of axons from other neurons through a **dentrite** that has a gap or **synapse** that controls the strength of signal through the axon.

Based on the input signals from the axons, the neuron can fire a signal to an output axon that other neurons use as inputs. Of course the axons are like inputs and outputs.

The human brain has 100 billion neurons. An individual neuron doesn't do that much but the composition of the interconnection and number of neurons allow for complex thinking.

ANNs are able to serve as another form of **classification (supervised) **that can be used on a larger variety of data input.

In an ANN the neuron is simulated with a perceptron or node that essentially takes the inputs and calculates an output based on weights of the inputs.

**Example of a task**: Output Y is 1 if at least two of the three inputs are equal to 1.

We want the network to learn this function. Of course, where we really use this is to learn to classify complicated structures, images, data sets, etc.

We add weights to each of the inputs. Sum the weights and compare to the threshold t to determine what the node outputs.

The ANN model is an assembly/network of inter-connected nodes and weighted links.

The output node sums up each of its input value according to the weights of its links and then compares the sum against some threshold t to determine one of two output values. The neuron "fires" or not.

These two functions are the same. The second one defines the weight w_{0} = -t and X_{0} = 1 to streamline the summation.

single-layered network (perceptron)

multi-layered network -- there can be many layers at the hidden stage

feed-forward -- output of each layer is connected as inputs to the next layer

recurrent network -- interconnections may cycle back

Each output is binary. If you want to have different classifications, you'll need more outputs to code those possibilities as shown to the right.

Sum the weighted inputs of the node and apply the activation function

Single layer network: Contains only input and output nodes

Activation function:** f = sign(w ^{T}X)** //dot product of weights and inputs, as above

Applying model is straightforward

X_{1} = 1, X_{2} = 0, X_{3} =1 as inputs will be computed as

y = sign(0.3 + 0.0 + 0.3 - 0.4) = sign(0.2) = 1

Initialize the weights (w_{0}, w_{1}, …, w_{d}) at 0 or randomize.

**Repeat**

- For each training example
**(x**_{i}, y_{i})- Compute
**f(w, x**-- f() is the current perceptron's calculation of y_{i})_{i}for x_{i}based on the set of weights. - Update the weights:

- Compute

**Until** stopping condition is met

λis the learning rate 0 ≤ λ ≤ 1 and controls the rate of change to the weights

Update the weight based on error:

- If y = f(x,w), e = 0: no update needed
- If y > f(x,w), e = 2: weight must be increased so that f(x,w) will increase towards this instance
- If y < f(x,w), e = -2: weight must be decreased so that f(x,w) will decrease towards this instance

The left table are the 8 instances with X_{0}=1

Middle table show the steps of how the weights are reset for each instance.

The right Epoch table shows the updates after applying corrections to the weights again after running the instances through the learning cycle.

Example calculation details:

Instance 1: f((0,0,0,0), (1,1,0,0)) = 0, since Y= -1 < f(w,x)=1, then **e =(-1+-1) -2** and** λe = -0.2**

w0 = 0 + -0.2*1 = -0.2

w1 = 0 + -0.2*1 = -0.2

w2 = 0 + -0.2*0 = 0

w3 = 0 + -0.2*0 = 0

Instance 2: f ( (-0.2,-0.2,0,0), (1,1,0,1)) = -0.2 + -0.2 + 0 + 0 = -0.4, since Y= 1 > f(w,x)=-1, then **e = 1-(-1)=2 **and **λe = 0.2**

w0 = -0.2 + 0.2*1 = 0

w1 = -0.2 + 0.2*1 = 0

w2 = 0 + 0.2*0 = 0

w3 = 0 + 0.2*1 = 0.2

Instance 3: f ( (0,0,0,0.2), (1,1,1,0)) = 0 + 0 + 0 + 0 = 0, since Y= 1 = f(w,x)=1, then **e = 0 **and **λe = 0**

w0 = 0 + 0.0*1 = 0.0

w1 = 0 + 0.0*1 = 0.0

w2 = 0 + 0.0*1 = 0.0

w3 = 0.2 + 0.0*0 = 0.2

And so forth. Below is a Python implementation of the perceptron learning algorithm

def sign(x): if x<0: return -1 else: return 1 def perceptron(x,y,lamb): w = [] for i in range(len(x[0])): w.append(0) for epoch in range(6): for i in range(len(x)): yhat = 0 for j in range (len(x[i])): yhat += x[i][j] * w[j] err = y[i] - sign(yhat) for j in range (len(x[i])): w[j] += lamb*err * x[i][j] print (x[i], w) print (epoch+1,w) |
x = [[1, 1, 0, 0], [1, 1, 0, 1], [1, 1, 1, 0], [1, 1, 1, 1], [1, 0, 0, 1], [1, 0, 1, 0], [1, 0, 1, 1], [1, 0, 0, 0]] y = [-1, 1, 1, 1, -1, -1, 1, -1] perceptron(x,y,0.1) [1, 1, 0, 0] [-0.2, -0.2, 0.0, 0.0] [1, 1, 0, 1] [0.0, 0.0, 0.0, 0.2] [1, 1, 1, 0] [0.0, 0.0, 0.0, 0.2] [1, 1, 1, 1] [0.0, 0.0, 0.0, 0.2] [1, 0, 0, 1] [-0.2, 0.0, 0.0, 0.0] [1, 0, 1, 0] [-0.2, 0.0, 0.0, 0.0] [1, 0, 1, 1] [0.0, 0.0, 0.2, 0.2] [1, 0, 0, 0] [-0.2, 0.0, 0.2, 0.2] 1 [-0.2, 0.0, 0.2, 0.2] [1, 1, 0, 0] [-0.2, 0.0, 0.2, 0.2] [1, 1, 0, 1] [-0.2, 0.0, 0.2, 0.2] [1, 1, 1, 0] [-0.2, 0.0, 0.2, 0.2] [1, 1, 1, 1] [-0.2, 0.0, 0.2, 0.2] [1, 0, 0, 1] [-0.4, 0.0, 0.2, 0.0] [1, 0, 1, 0] [-0.4, 0.0, 0.2, 0.0] [1, 0, 1, 1] [-0.2, 0.0, 0.4, 0.2] [1, 0, 0, 0] [-0.2, 0.0, 0.4, 0.2] 2 [-0.2, 0.0, 0.4, 0.2] [1, 1, 0, 0] [-0.2, 0.0, 0.4, 0.2] [1, 1, 0, 1] [-0.2, 0.0, 0.4, 0.2] [1, 1, 1, 0] [-0.2, 0.0, 0.4, 0.2] [1, 1, 1, 1] [-0.2, 0.0, 0.4, 0.2] [1, 0, 0, 1] [-0.4, 0.0, 0.4, 0.0] [1, 0, 1, 0] [-0.6, 0.0, 0.2, 0.0] [1, 0, 1, 1] [-0.4, 0.0, 0.4, 0.2] [1, 0, 0, 0] [-0.4, 0.0, 0.4, 0.2] 3 [-0.4, 0.0, 0.4, 0.2] [1, 1, 0, 0] [-0.4, 0.0, 0.4, 0.2] [1, 1, 0, 1] [-0.2, 0.2, 0.4, 0.4] [1, 1, 1, 0] [-0.2, 0.2, 0.4, 0.4] [1, 1, 1, 1] [-0.2, 0.2, 0.4, 0.4] [1, 0, 0, 1] [-0.4, 0.2, 0.4, 0.2] [1, 0, 1, 0] [-0.4, 0.2, 0.4, 0.2] [1, 0, 1, 1] [-0.4, 0.2, 0.4, 0.2] [1, 0, 0, 0] [-0.4, 0.2, 0.4, 0.2] 4 [-0.4, 0.2, 0.4, 0.2] [1, 1, 0, 0] [-0.4, 0.2, 0.4, 0.2] [1, 1, 0, 1] [-0.2, 0.4, 0.4, 0.4] [1, 1, 1, 0] [-0.2, 0.4, 0.4, 0.4] [1, 1, 1, 1] [-0.2, 0.4, 0.4, 0.4] [1, 0, 0, 1] [-0.4, 0.4, 0.4, 0.2] [1, 0, 1, 0] [-0.4, 0.4, 0.4, 0.2] [1, 0, 0, 0] [-0.4, 0.4, 0.4, 0.2] 5 [-0.4, 0.4, 0.4, 0.2] [1, 1, 0, 0] [-0.4, 0.4, 0.4, 0.2] [1, 1, 0, 1] [-0.4, 0.4, 0.4, 0.2] [1, 1, 1, 0] [-0.4, 0.4, 0.4, 0.2] [1, 1, 1, 1] [-0.4, 0.4, 0.4, 0.2] [1, 0, 0, 1] [-0.4, 0.4, 0.4, 0.2] [1, 0, 1, 0] [-0.4, 0.4, 0.4, 0.2] [1, 0, 1, 1] [-0.4, 0.4, 0.4, 0.2] [1, 0, 0, 0] [-0.4, 0.4, 0.4, 0.2] 6 [-0.4, 0.4, 0.4, 0.2] |

Since f(w,x) is a linear combination of input variables, the decision boundary is linear.

For nonlinearly separable problems, perceptron learning algorithm will fail because no linear hyperplane can separate the data perfectly.

Let's back up a little. We didn't really talk about this approach of classification-- to establish a demarcation line between groups, or clusters of instances and represesnt it as an equation

+ = Iris versicolor

x =Iris setosa

Of course this concept can be extended to higher dimension, thinking of the line as a hyperplane.

The linear equation representing this line is f

() = 2.0 – 0.5 *PetalLength– 0.8 *PetalWidth= 0if

f() >= 0:Iris setosa

else:Iris versicolor

2.0 =bias

0.5 and 0.8 =weights

Task: Find values for the weights and the bias, so that training data is correctly classified by line (equation).

In general:

y = w_{0}+ w_{1}a_{1}+ w_{2}a_{2}+ ... + w_{k}a_{k}_{}Where:

y: class

ai = 1 ... k: attribute values_{i},

wj = 0 ... k: weights_{j},Task: Find optimal weights

wto separate the different classes_{j}Weights are calculated from training data

Mathematical technique: Linear optimization

Perceptrons have no middle, hidden layers.

Hidden layers: intermediary layers between input & output layers

More general activation functions (sigmoid, linear, etc) are typically used

Multi-layer neural network can solve any type of classification task involving nonlinear decision surfaces

Can we apply perceptron learning rule to each node, including hidden nodes?

- Perceptron learning rule computes error term e = y-f(w,x) and updates weights accordingly

Problem: how to determine the true value of y for hidden nodes?

Approximate error in hidden nodes by error in the output nodes

But the problem is:

- Not clear how adjustment in the hidden nodes affect overall error
- No guarantee of convergence to optimal solution

Weight update:

Error function:

Activation function f must be differentiable and is the reason the sign function is not used

For sigmoid function:

Stochastic gradient descent (update the weight immediately)

For output neurons, weight update formula is the same as before (gradient descent for perceptron)

For hidden neurons:

Number of nodes in input layer

- One input node per binary/continuous attribute
- k or log
_{2}k nodes for each categorical attribute with k values

Number of nodes in output layer

- One output for binary class problem
- k or log
_{2}k nodes for k-class problem

Number of nodes in hidden layer

Initial weights and biases

Multilayer ANN are universal approximators but could suffer from overfitting if the network is too large

Gradient descent may converge to local minimum

Model building can be very time consuming, but testing can be very fast

Can handle redundant attributes because weights are automatically learned

Sensitive to noise in training data

Difficult to handle missing attributes

Use in deep learning and unsupervised feature learning

- Seek to automatically learn a good representation of the input from unlabeled data
Google Brain project

- Learned the concept of a ‘cat’ by looking at unlabeled pictures from YouTube
- One billion connection network