
Exercises: Miniproject Assistant: Thomas Stroesslin 
The aim of this miniproject is to give you the feeling of a realistic application of artificial neural networks.
As always in supervised learning, you start from a database
{(x^{mu}, t^{mu}); mu = 1, ... p }.
You will use a neural network to predict output values t^{mu}.
The project consists of the following six steps:
These steps are described below.
An overview of the databases is here .
The general philosophy of the DELVE data base is described here.
For the Miniproject, choose one of the following four tasks, for which DELVE provides databases. (for each of the databases, look at the `detailed documentation' to find out what exactly is to be done)
Use your code for BackProp/Multilayer perceptron with momentum that you have developed so far.
If you want you can look at the C program which is available at http://lslwww.epfl.ch/~aperez/NN_tutorial/Bkprop.c or at the program bpsim at http://diwww.epfl.ch/mantra/tutorial/english/mlpc/index.html. You may use any programming language you want.
In your program you should compare two out of the following three methods of regularisation:
The penalty term for weight decay is:
penalty = lambda * Sum_{i} (w_{i}^{2})
where the sum runs over all weights in all layers, but not over the thresholds. (Remember that the thresholds are treated as additional weights attached to a bias unit with constant activity x_{i} = 1. Thus in terms of the learning algorithm the thresholds are treated as further weights. They are, however, not included in the sums of the penalty term.)
Similarly the term for weight elimination is
penalty = lambda * Sum_{i} (w_{i}^{2} / c + w_{i}^{2})
Again, the sum does not include the thresholds. Take for example c=N where N is the number of weights.
Don't touch the data in the third group during learning. It is reserved for the final performance measure.
In the Delve database, results will be reported as s function of the number of data you used for 1)+2) together. For example 200 samples for 1) + 2) and the rest for error measures. Or 512 samples in 1) and 2) and the rest for error measures. Check on the result data page for your data base, what numbers you should use. To find this out, click in the column 'view results' of the DELVE summary table .
Suppose you take a total of 512 data points for learning and testing (groups 1 and 2). Use the method of crossvalidation in order to optimize lambda. This means: For each value of lambda, split the data points randomly into two groups 1) and 2) of comparable size, run Backprop and record the learning and the validation error. Repeat about ten times with different splits and take the maen. (Thus for each value of lambda, you make several complete learning trials).
Now, if your database is rather small (let us assume 128 data samples for training and validation), you might want to use the systemmatic leave one out cross validation technique to get better results: Take each sample in turn as the validation set, then avearage your 128 error measurements.
TASK: plot the learning error and test error as a function of lambda.
In case of early stopping:
TASK: plot the learning error and test error as a function of the learning time.
Once you have found the optimal lambda you can retrain your network with this value of lambda You should restart the training process several times to retain the best solution.
Proceed similarly for early stopping: make several repetitions and retain the best solution.
At the very end, use the resulting network in forward mode on part 3) of the data base. This gives you the final prediction error. (Note: once you touch part 3) you are no longer allowed to change parameters or 'improve' the network.)
TASK: Report the performace error of your network and compare it with the performace error of other methods as reported in the DELVE data base.
Click in the column 'view results' of the DELVE summary.
DELVE uses the following formula for final error measurements:prediction tasks: E = (Sum_{mu}[t^{mu}  x^{out}(mu)]^{2}) / (Sum_{mu}[t^{mu}  <t>]^{2}) ; mu = 1, ... p }.
classification tasks: E = (Sum_{mu}t^{mu}  sgn[x^{out}(mu)]) / (Sum_{mu}t^{mu}  t_{max}) ; where t_{max} is the biggest class}.
For more information about DELVE's error measurements, see chapter 8 of the Delve User Manual
TASK: Compare the performance of the two regularization methods that you have chosen. Which one is better for your data set?
If you have time, you may also think about whether the difference is significant. How to measure significance is indicated here.
In the report you should state:
Instead of writing a report you may also present your results in a short seminar talk.