I- Neural Networks: Representation
a. Model representation 1:
Let's examine how we will represent a hypothesis function using neural networks. At a very simple level, neurons are basically computational units that take inputs (dendrites) as electrical inputs (called "spikes") that are channeled to outputs (axons). In our model, our dendrites are like the input features x1⋯xn, and the output is the result of our hypothesis function. In this model our x0 input node is sometimes called the "bias unit." It is always equal to 1. In neural networks, we use the same logistic function as in classification,
yet we sometimes call it a sigmoid (logistic) activation function. In this situation, our "theta" parameters are sometimes called "weights".
Visually, a simplistic representation looks like:
Our input nodes (layer 1), also known as the "input layer", go into another node (layer 2), which finally outputs the hypothesis function, known as the "output layer".
We can have intermediate layers of nodes between the input and output layers called the "hidden layers."
In this example, we label these intermediate or "hidden" layer nodes a02..........an2 and call them "activation units."
If we had one hidden layer, it would look like:
The values for each of the "activation" nodes is obtained as follows:
This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix θ(2) containing the weights for our second layer of nodes
Each layer gets its own matrix of weights, θ(j).
The dimensions of these matrices of weights is determined as follows:
The +1 comes from the addition in θ(j) of the "bias nodes," x0 and θ0(j). In other words the output nodes will not include the bias nodes while the inputs will. The following image summarizes our model representation:
Example: layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of θ(1) is going to be 4×3 where sj=2 and sj+1=4, so × (sj+1)=4×3.
b. Model representation 2:
To re-iterate, the following is an example of a neural network:
In this section we'll do a vectorized implementation of the above functions. We're going to define a new variable zk(j) that encompasses the parameters inside our g function. In our previous example if we replaced by the variable z for all the parameters we would get:
In other words, for layer j=2 and node k, the variable z will be:
The vector representation of x and zj is:
Setting x = a(1) , we can rewrite the equation as:
We are multiplying our matrix θ(j-1) with dimensions sj × (n+1) (where is the number of our activation nodes) by our vector aj-1 with height (n+1). This gives us our vector zj with height sj. Now we can get a vector of our activation nodes for layer j as follows:
Where our function g can be applied element-wise to our vector zj.
We can then add a bias unit (equal to 1) to layer j after we have computed a(j). This will be element a0(j) and will be equal to 1. To compute our final hypothesis, let's first compute another z vector:
We get this final z vector by multiplying the next theta matrix after θ(j-1) with the values of all the activation nodes we just got. This last theta matrix θ(j) will have only one row which is multiplied by one column a(j) so that our result is a single number. We then get our final result with:
Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression. Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.
c. Applications:
i. Intuition 1:
A simple example of applying neural networks is by predicting x1 AND x2, which is the logical 'and' operator and is only true if both x1 and x2 are 1.
The graph of our functions will look like:
Remember that x0 is our bias variable and is always 1.
Let's set our first theta matrix as:
This will cause the output of our hypothesis to only be positive if both x1 and x2 are 1. In other words:
So we have constructed one of the fundamental operations in computers by using a small neural network rather than using an actual AND gate. Neural networks can also be used to simulate all the other logical gates. The following is an example of the logical operator 'OR', meaning either x1 is true or x2 is true, or both:
Where g(z) is the following:
ii. Intuition 2:
The θ(1) matrices for AND, NOR, and OR are:
We can combine these to get the XNOR logical operator (which gives 1 if x1 and x2 are both 0 or both 1).
For the transition between the first and second layer, we'll use a θ(1) matrix that combines the values for AND and NOR:
For the transition between the second and third layer, we'll use a θ(2) matrix that uses the value for OR:
Let's write out the values for all our nodes:
And there we have the XNOR operator using a hidden layer with two nodes! The following summarizes the above algorithm:
d. Multiclass Classification:
To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we wanted to classify our data into one of four categories. We will use the following example to see how this classification is done. This algorithm takes as input an image and classifies it accordingly: