Generalization in Multi-Layer Perceptrons

Introduction

This applet illustrates the generalization capabilities of the multi-layer perceptrons. It allows you to define two different sets of data: one for training and the other for cross-validation. The two sets are necessary to study generalization in a systematic manner.

Credits

The original applet was written by Olivier Michel. It was modified by Alix Herrmann and Angelo Arleo to operate in [-1,1].

Instructions

Use the popup menu to choose learning points for training or cross-validation. The graph will display in black the error on the training set and in green the error on the cross-validation set.

Applet

Questions

For all questions except the last, leave the decay parameter zero.

Easy problem: Set two simple clusters, a red one (1's) and a blue one (-1's), of training points linearly separable and well distinct. Then, add cross-validation points in each cluster. To be realistic, the cross-validation points should be of the same color as training points in the same cluster. Run learning for about 100 iteration and observe the resulting error graphs. Could you comment on both errors ?
More complicated problem: Use two similar simple clusters, but set some cross-validation points a little bit outside the training clusters. Do you observe any change in the error graphs? Why?
Hard problem: Now, create two linearly separable clusters, but very close to each other. Create cross-validation points and put some of the cross-validation points slighly outside the clusters, even inside the other cluster. Run the learning and comment results. Did you observe that the error graph reaches a minimum and then rise again ? How would you explain this?
Non-linearly separable problem: Set 3 blue training points on the left hand side of the space, 6 red training points in the middle and 3 blue training points on the right hand side. Add 3 cross validation points in the first set, 6 in the second and 3 in the last one. Change the number of hidden units and the learning parameters if neccessary to obtain the convergence to a null error on the training set. Can you observe a similar error graph as in question 1 ? Why ?
Getting more and more complicated: Try to solve more complicated problems (e.g., similar to questions 2 and 3) with non-linearly seperable clusters.
General questions: How would you characterize the evolution of the error on a cross-validation set ? How should a training set be designed in order to get the best results ?
Weight elimination algorithm: As discussed in class, the decay parameter controls an extra term in the weight update step. Set the decay parameter to a small value such as .001 and use several (at least 4) units in the hidden layers. (Don't forget to click Init each time you change any of the network parameters.) Compare the training results with standard backprop (decay=0.0).