CS91 CS9151 51 MLT MLT
UNIT UNIT – II LINE LINEAR AR MODE MODELS LS
MACE MACET T
UNIT II LINEAR MODELS
"ulti#layer $erceptron % &oing Forwards % &oing 'ackwards: 'ack $ropagation (rror % "ulti#layer $erceptron in $ractice % (xamples of using the ")$ % *verview % +eriving 'ack#$ropagation % adial 'asis Functions and Splines % -oncepts % 'F etwork % -urse of +imensionality % /nterpolations and 'asis Functions % Support 0ector "achines MULTI-LAYER PERCEPTRON The multil multilaye ayerr percep perceptro tron n is an artifi artificia ciall neural neural networ network k struc structur turee and is a nonpar nonparame ametri tricc estimator that can be used for classification and regression. We We discuss the back propagation algorithm to train a multilayer perceptron for a variety of applications. We have pretty much decided that the learning in the neural network happens in the weights. So, to perform more computation it seems sensible to add more weights. There are two things that we can do: add some backwards connections, so that the output neurons connect to the inputs again, or add more neurons. The first approach leads into recurrent networks. These have been studied, but are not that commonly used.We will instead consider the second approach. We can add neurons between the input nodes and the outputs, and this will make more complex neural networks, such as the one shown in Figure .!.
1
CS91 CS9151 51 MLT MLT
UNIT UNIT – II LINE LINEAR AR MODE MODELS LS
MACE MACET T
We will think about why adding extra layers of nodes makes a neural network more powerful in 1idden layer, but for now, to persuade ourselves that it is true, we can check that a prepared network can solve the two#dimensional 2* problem, something that we have seen is not possible for a linear model like the $erceptron. 3 suitable network is shown in Figure .4. To check that it gives the correct answers, all that is re5uired is to put in each input and work through the network, treating it as two different $erceptrons, first computing the activations of the neurons in the middle layer 6labelled as - and + in Figure .47 and then using those activations as the inputs to the single neuron at the output. 3s an example, /8ll work out what happens when you put in 6! , 6! , 97 as an input the ;ob of checking the rest is up to you.
2
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
Input (1 0) correspons to noe A !e"n# 1 $n % !e"n# 0& T'e "nput to neuron C "s t'ereore −1 × 0 5 ,
.
1 × 1 0 × 1 * −0 5 1 * 0 5& T'"s "s $!o+e t'e t'res'o, o 0- $n so neuron C .res- #"+"n# output 1& .
.
/or neuron D t'e "nput "s −1 × 1 1 × 1 0 × 1 * −1 1 * 0- $n so "t oes not .re- #"+"n# output 0& T'ereore t'e "nput to neuron E "s −1×0 51×10×−1 * 0 5- so neuron E .res& C'ec"n# t'e resu,t o .
.
t'e "nputs s'ou, persu$e ou t'$t neuron E .res 'en "nputs A $n % $re "3erent to e$c' ot'er- !ut oes not .re 'en t'e $re t'e s$4e- '"c' "s e$ct, t'e 6OR unc7on ("t oesn8t 4$er t'$t t'e .re $n not .re '$+e !een re+erse)&
So far, so good. Since this network can solve a problem that the $erceptron cannot, it seems worth looking into further. 1owever, now we8ve got a much more interesting problem to solve, namely how can we train this network so that the weights are adapted to generate the correct 6target7 answers< /f we try the method that we used for the $erceptron we need to compute the error at the output. That8s
fine, since we know the targets there, so we can compute the difference between the targets and the outputs. 'ut now we don8t know which weights were wrong: those in the first layer, or the second< Worse, we don8t know what the correct activations are for the neurons in the middle of the network. This fact gives the neurons in the middle of the network their name they are called the hidden laer 6or layers7, because it isn8t possible to examine and correct their values directly.
:
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
3rtificial neural networks have many interconnected neurons, each with its own input, output, and activation function.
;
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
"ultilayer perceptrons have one or more layers between the input nodes and the eventual output nodes. The 2* example has a middle layer, called a hidden layer, between the input and the output 6see Figure =#>7. 3lthough you and / know what the outputs of an 2* gate would be 6/8ve ;ust outlined them in the table7, and we could define the middle layer ourselves, a truly automated learning platform would take some time.
!OIN! "OR#ARDS ?ust as it did for the $erceptron, training the MLP consists of t$o parts: working out what the outputs are for the given inputs and the current weights, and then updating the weights according to the error, which is a function of the difference between the outputs and the targets. These are generally known as going forwards and backwards through the network.
5
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
We8ve already seen how to go forwards for the ")$ when we saw the 2* example above, which was effectively the recall phase of the algorithm. /t is pretty much ;ust the same as the $erceptron, except that we have to do it twice, once for each set of neurons, and we need to do it layer by layer, because otherwise the input values to the second layer don8t exist. /n fact, having made an ")$ with two layers of nodes, there is no reason why we can8t make one with @, or , or 49 layers of nodes
%iases We need to include a bias input to each neuron. We do this in the same way as we did for the $erceptron in Section @
[email protected], by having an extra input that is permanently set to #!, and ad;usting the weights to each neuron as part of the training.
!OIN! %ACARDS' %AC&-PROPA!ATION O" ERROR -omputing the errors at the output is no more difficult than it was for the $erceptron, but working out what to do with those errors is more difficult. The method that we are going to look at is called (ac)propa*ation of error, which makes it clear that the errors are sent backwards through the network. /t is a
form of *radient descent+ The best way to describe back#propagation properly is mathematically, but this can be intimidating and difficult to get a handle on at first. There are actually ;ust three things that you need to
!
know, all of which are from differential calculus: the derivative of
4 x4, the fact that if you
differentiate a function of x with respect to some other variable t , then the answer is 9, and the chain rule, which tells you how to differentiate composite functions.
When we talked about the $erceptron, we changed the weights so that the neurons fired when the targets said they should, and didn8t fire when the targets said they shouldn8t. What we did was to choose an error function for each neuron k : Ek A yk B tk , and tried to make it as small as possible. We still want to do the same thingCminimise the error, so that neurons fire only when they shouldCbut, with the addition of extra layers of weights, this is harder to arrange. The problem is that when we try to adapt the weights of the "ulti#layer $erceptron, we have to work out which weights caused the error. This could be the weights connecting the inputs to the hidden layer, or the weights connecting the hidden layer to the output layer. <
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
The error function that we used for the $erceptron was
, where N is
the number of output nodes. 1owever, suppose that we make two errors. /n the first, the target is bigger than the output, while in the second the output is bigger than the target. We can do this in a few different ways, but the one that will turn out to be best is the sum#of# s5uares error function, which calculates the difference between y and t for each node, s5uares them, and adds them all together:
/magine a ball rolling around on a surface that looks like the line in Figure .@. &ravity will make the ball roll downhill 6follow the downhill gradient7 until it ends up in the bottom of one of the hollows. These are places where the error is small, so that is exactly what we want. This is why the algorithm is called gradient descent.
1aving mentioned the activation function, this is a good time to point out a little problem with the threshold function that we have been using for our neurons so far, which is that it is discontinuous 6see Figure . it has a sudden ;ump in the middle7 and so differentiating it at that point isn8t possible. The problem is that we need that ;ump between firing and not firing to make it act like a neuron.
=
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
We can solve the problem if we can find an activation function that looks like a threshold function, but is differentiable so that we can compute the gradient. /f you s5uint at a graph of the threshold function 6for example, Figure .7 then it looks kind of S#shaped. There is a mathematical form of S#shaped functions, called sigmoid functions 6see Figure . =7.
The most commonly used form of this function 6where D is some positive parameter7 is:
/n some texts you will see the activation function given a different form, as:
>
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
which is the hyperbolic tangent function. This is a different but similar function it is still a sigmoid function, but it saturates 6reaches its constant values7 at E! instead of 9 and !, which is sometimes useful. 3s an algorithm goes, we8ve fed our inputs forward through the network and worked out which nodes are firing. ow, at the output, we8ve computed the errors as the sum#s5uared difference between the outputs and the targets 6(5uation 6.!7 above7. What we want to do next is to compute the gradient of these errors and use them to decide how much to update each weight in the network. We will do that first for the nodes connected to the output layer, and after we have updated those, we will work backwards through the network until we get back to the inputs again. There are ;ust two problems:
for the output neurons, we don8t know the inputs. for the hidden neurons, we don8t know the targets for extra hidden layers, we know neither the inputs nor the targets, but even this won8t matter for the algorithm we derive.
So we can compute the error at the output, but since we don8t know what the inputs were that caused it, we can8t update those second layer weights the way we did for the $erceptron. /f we use the chain rule of differentiation that you all 6possibly7 remember from high school then we can get around this problem. The chain rule tells us that if we want to know how the error changes as we vary the weights, we can think about how the error changes as we vary the inputs to the weights, and multiply this by how those input values change as we vary the weights. This is useful because it lets us calculate all of the derivatives that we want to: we can write the activations of the output nodes in terms of the activations of the hidden nodes and the output weights, and then we can send the error calculations back through the network to the hidden layer to decide what the target outputs were for those neurons. Two different update functions, one for each of the sets of weights, and we ;ust apply these backwards through the network, starting at the outputs and ending up back at the inputs.
9
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
The "ulti#layer $erceptron 3lgorithm We will assume that there are L input nodes, plus the bias, M hidden nodes, also plus a bias, and N output nodes, so that there are 6 L!7G M weights between the input and the hidden layer and 6 M !7G N between the hidden layer and the output. The sums that we write will start from 9 if they include the bias nodes and ! otherwise, and run up to L,M , or N , so that x 0 A B! is the bias input, and a0 A B! is the bias hidden node. 1ere is a 5uick summary of how the algorithm works, and then the full ")$ training algorithm using back#propagation of error is described. !. an input vector is put into the input nodes 4. the inputs are fed forward through the network 6Figure .H7 I the inputs and the first#layer weights 6here labelled as v7 are used to decide whether the hidden nodes fire or not. The activation function g 6J7 is the sigmoid function given in (5uation 6.47 above I the outputs of these neurons and the second#layer weights 6labelled as w7 are used to decide if the output neurons fire or not @. the error is computed as the sum#of#s5uares difference between the network outputs and the targets . this error is fed backwards through the network in order to I first update the second#layer weights I and then afterwards, the first#layer weights
10
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
T,E MULTI-LAYER PERCEPTRON IN PRACTICE
11
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
we are going to look more at choices that can be made about the network in order to use it for solving real problems. We will then apply these ideas to using the ")$ to find solutions to four different types of problem: regression, classification, time-series prediction, and data compression.
Amount of Training Data For the ")$ with one hidden layer there are 6 L !7 G M 6 M !7 G N weights, where L,M,N are the number of nodes in the input, hidden, and output layers, respectively. The extra !s come from the bias nodes, which also have ad;ustable weights. This is a potentially huge number of ad;ustable parameters that we need to set during the training phase. The more training data there is, the better for learning, although the time that the algorithm takes to learn increases. Knfortunately, there is no way to compute what the minimum amount of data re5uired is, since it depends on the problem. 3 rule of thumb that has been around for almost as long as the ")$ itself is that you should use a number of training examples that is at least !9 times the number of weights. Number of idden Layers There are two other considerations concerning the number of weights that are inherent in the calculation above, which is the choice of the number of hidden nodes, and the number of hidden layers. "aking these choices is obviously fundamental to the successful application of the algorithm. We will shortly see a pictorial demonstration of the fact that two hidden layers is the most that you ever need for normal ")$ learning. /n fact, this result can be strengthened: it is possible to show mathematically that one hidden layer with lots of hidden nodes is sufficient. This is known as the Uniersal Appro.i/ation Theore/+ The basic idea is that by combining sigmoid functions we can generate ridge#like functions, and by combining ridge#like functions we can generate functions with a uni5ue maximum. 'y combining these and transforming them using another layer of neurons, we obtain a localised response 6a Lbump8 function7, and any functional mapping can be approximated to arbitrary accuracy using a linear combination of such bumps.
Two hidden layers are sufficient to compute these bump functions for different inputs, and so if the function that we want to learn 6approximate7 is continuous, the network can compute it. 12
CS9151 MLT
1:
UNIT – II LINEAR MODELS
MACET
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
!"en to #top Learning The training of the ")$ re5uires that the algorithm runs over the entire dataset many times, with the weights changing as the network makes errors in each iteration. The 5uestion is how to decide when to stop learning, and this is a 5uestion that we are now ready to answer. 3t some stage the error on the validation set will start increasing again, because the network has stopped learning about the function that generated the data, and started to learn about the noise that is in the data itself 6shown in Figure .!!7. 3t this stage we stop the training. This techni5ue is called earl stoppin* & E0AMPLES O" USIN! T,E MLP
The ")$ is rather too complicated to enable us to work through the weight changes as we did with the $erceptron. we shall look at the four types of problems that are generally solved using an ")$: regression, classification, time#series prediction, and data co mpressionMdata denoising. A $egression %rob&em The regression problem we will look at is a very simple one. We will take a set of samples generated by a simple mathematical function, and try to learn the generating function 6that describes how the data was made7 so that we can find the values of any inputs, not ;ust the ones we have training data for. 1;
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
The function that we will use is a very simple one, ;ust a bit of a sine wave. We8ll make the data in the following way 6make sure that you have um$y imported as np first7:
The reason why we have to use the reshape67 method is that um$y defaults to lists for arrays that are N G! compare the results of the np.shape67 calls below, and the effect of the transpose operator .T on the array:
'efore getting started, we need to normalise the data using the method shown in Section @..=, and then separate the data into training, testing, and validation sets. For this example there are only 9 datapoints, and we8ll use half of them as the training set, although that isn8t very many and might not be enough for the algorithm to learn effectively. We can split the data in the ratio =9:4=:4= by using the odd#numbered elements as training data, the even#numbered ones that do not divide by for testing, and the rest for validation:
15
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
With that done, it is ;ust a case of making and training the ")$. To start with, we will construct a network with three nodes in the hidden layer, and run it for !9! iterations with a learning rate of 9.4=, ;ust to see that it works:
The output from this will look something like:
so we can see that the network is learning, since the error is decreasing. We now need to do two things: work out how many hidden nodes we need, and decide how long to train the network for.
The following table shows the results of doing this, reporting the sum#of#s5uares validation error, for a few different siNes of network:
'ased on these numbers, we would select a network with a small number of hidden nodes, certainly between 4 and !9 6and the smaller the better, in general7, since their maximum error is much smaller than a network with ;ust ! hidden node. '&assification wit" t"e ML% Ksing the ")$ for classification problems is not radically different once the output encoding has been worked out. The inputs are easy: they are ;ust the values of the feature measurements 6suitably normalised7. There are a couple of choices for the outputs. The first is to use a single linear node for the output, y, and put some thresholds on the activation value of that node. For example, for a four#class problem, we could use:
1<
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
what about an example that is very close to a boundary, say y A 9(=< We arbitrarily guess that it belongs to class ' @, but the neural network doesn8t give us any information about how close it was to the boundary in the output, so we don8t know that this was a difficult example to classify. 3 more suitable output encoding is called !#of# N encoding. *nce the network has been trained, performing the classification is easy: simply choose the element yk of the output vector that is the largest element of 6in mathematical notation, pick the yk for which yk )y *
means for all, so this statement says pick the yk that is bigger
than all other possible values y * 7. This generates an unambiguous decision, since it is very unlikely that two output neurons will have identical largest output values. This is known as the hard-/a. actiation f1nction 3n alternative is the soft-/a. f1nction , which we saw in Different O1tp1t Actiation "1nctions , and which has the effect of scaling the output of each neuron according to how large
it is in comparison to the others, and making the total output sum to !. A '&assification Examp&e+ T"e ris Dataset 3s an example we are going to look at another example from the K-/ "achine )earning repository. This one is concerned with classifying examples of three types of iris 6flower7 by the length and width of the sepals and petals and is called iris+ There are two alternatives. *ne is to edit the data in a text editor using search and replace, and the other is to use some $ython code, such as this function:
1=
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
Oou can then load it from the new file using loadtxt67. /n the dataset, the last column is the class /+, and the others are the four measurements.
The first few datapoints will then look like:
Time-#eries %rediction
1>
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
There is a common data analysis task known as time#series prediction, where we have a set of data that show how something varies over time, and we want to predict how the data will vary in the future. /t is 5uite a difficult task, but a fairly important one. /t is useful in any field where there is data that appears over time, which is to say almost any field. The problem is that even if there is some regularity in the time#series, it can appear over many different scales. For example, there is often seasonal variationCif we plotted average temperature over several years, we would notice that it got hotter in the summer and colder in the winter, but we might not notice if there was a overall upward or downward trend to the summer temperatures, because the summer peaks are spread too far apart in the data. We can write this as an e5uation, where we are predicting y using a neural network that is written as a function f 6J7:where the two 5uestions about how many datapoints and how far apart they should be come down to choices about
and k .
The target data for training the neural network is simple, because it comes from further up the time#series, and so training is easy. Suppose that
A 4 and k A @. Then the first input data
are elements ! , @ , = of the dataset, and the target is element >. The next input vector is elements 4 , , H, with target P, and then @ , = , > with target Q. Oou train the network by passing through the time#series 6remembering to save some data for testing7, and then press on into the future making predictions. Figure .! shows an example of a time#series with
A @ and k A , with a set of
datapoints that make up an input vector marked as white circles, and the target coloured black. 3 typical time#series problem is to predict the oNone levels into the future and see if you can detect an overall drop in the mean oNone level. Oou can load the data using $oN A loadtxt68$*N.dat87 6once you8ve downloaded it from the website7, which will load the data and stick it into an array called $oN. There are elements to each vector: the year, the day of the year, and the oNone level and sulphur dioxide level, and there are 4P== readings. To ;ust plot the oNone data so that you can see what it looks like, use
19
CS9151 MLT
UNIT – II LINEAR MODELS
The first thing is to choose values of values out of the array with spacing
MACET
and k . Then it is ;ust a 5uestion of picking k
, which is a good use for the slice operator, as in this code:
Oou then need to assemble training, testing, and validation sets. Data 'ompression+ T"e Auto-Associative Network
We are now going to consider an interesting variation of the ")$. Suppose that we train the network to reproduce the inputs at the output layer 6called auto#associative learning sometimes the network is known as an autoencoder7. The network is trained so that whatever you show it at the input is reproduced at the output, which doesn8t seem very useful at first, but suppose that we use a hidden layer that has fewer neurons than the input layer 6see Figure .!>7. This (ottlenec) hidden layer has to represent all of the information in the input, so that it can be reproduced at the output. /t therefore performs some compression of the data, representing it using fewer dimensions than were used in the input. 20
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
This auto#associative network can be used to compress images and other data. 3 schematic of this is shown in Figure .!P: the 4+ image is turned into a !+ vector of inputs by cutting the image into strips and sticking the strips into a long line. The values of this vector are the intensity 6colour7 values of the image, and these are the input values.
21
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
3fter training, we can throw away the input nodes and first set of weights of the network. /f we insert some values in the hidden nodes 6their activations for a particular image see Figure .!Q7, then by feeding these activations forward through the second set of weights, the correct image will be reproduced on the output.
SUPPORT 2ECTOR MAC,INES
The geometry of a support vector classifier. The circled data points are the support vectors, which are the training examples nearest to the decision boundary. The support vector machine finds the decision boundary that maximises the margin
22
.
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
Figure 7.7: Support vector machine Ma.i/isin* the /ar*in I
Since we are free to rescale t , RR$RR and m, it is customary to choose m A !. "aximising the margin then corresponds to minimisingRR$33 or, more conveniently,
! 4
RR$334, provided of course
that none of the training points fall inside the margin. This leads to a 5uadratic, constrained optimisation problem:
3dding the constraints with multipliers function
2:
for each training example gives the )agrange
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
'y taking the partial derivative of the )agrange function with respect to t and setting it to
9 we find Similarly, by taking the partial derivative of the )agrange function with respect to $ and setting to 9 we obtain
% the same expression as we derived for
the perceptron.
For the perceptron, the instance weights
are non#negative integers denoting the
number of times an example has been misclassified in training. For a support vector machine, the
are non#negative reals.
What they have in common is that, if
for a particular example 0i , that example
could be removed from the training set without affecting the learned decision boundary. /n the case of support vector machines this means that
only for the support
vectors: the training examples nearest to the decision boundary. These expressions allow us to eliminate $ and t and lead to the dual )agrangian
The dual problem is to maximise this function under positivity constraints and one e5uality constraint: 2;
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
Figure :Two maximum-margin classifers 5left6 3 maximum#margin classifier built from three examples, with #7 69,#!M47 and margin 4.
The circled examples are the support vectors: they receive non#Nero )agrange multipliers and define the decision boundary. 5ri*ht6 'y adding a second positive the decision boundary is rotated to #76@M=,#M=7 and the margin decreases to !. E.a/ple
The matrix 08 on the right incorporates the class labels i.e., the rows are Oi2i . The &ram matrix is 6without and with class labels7:
25
CS9151 MLT
UNIT – II LINEAR MODELS
MACET
The dual optimisation problem is thus
Ksing the e5uality constraint we can eliminate one of the variables, say
, and simplify the
ob;ective function to
Setting partial derivatives to 9
we
obtain
and
6notice that, because the ob;ective function is 5uadratic, these e5uations are guaranteed to be linear7. We therefore obtain the solution and !MP. We then have $7!MP6.@#.47A , resulting in a margin of !M;;$ ;; 4. The Perceptron
The perceptron is the basic processing element. /t has inputs that may come from the environment or may be the outputs of other perceptrons. 3ssociated with each input, connection weig"t , or synaptic weig"t of the inputs 6see figure !!.!7:
2<
is a
and the output, y, in the simplest case is a weighted sum
CS9151 MLT
2=
UNIT – II LINEAR MODELS
MACET