Sunday, July 16, 2017

How Dropout works on Neural Network

Overview

Dropout is one of the good techniques to make good neural network model. The system of it is very simple.



What is Dropout?

Dropout is the technique to prevent neural network from over-fitting. At the time of training, it deactivates some rates of nodes, leading to prevent the model from over-fitting.
Usually, neural network has awesome capacity to trace data. It means it can easily come front of over-fitting. So, some techniques like regularization and dropout can fulfill important role to solve that.
You can imagine how dropout works.



The black nodes means deactivated nodes. At the training, the parameters are updated with some nodes deactivated.

Check concretely

By dropout, we can get two following effects.
  • easing over-fitting
  • improving accuracy by model combination

Let’s think the neural network below. It has input, hidden, output layer.



If you set dropout on the hidden layer, you can express that as the diagram below. The nodes, , means dropout. As you can see, each node on the specific layer has possibility to be activated.



To think it mathematically, let's check more accurate diagram.



This is not that the specified rate of nodes is activated but that each node has the specified rate of being activated.
By equation, you can express as below.




The expresses the dropout. Those are set to each nodes and follow Bernoulli distribution. It means the value becomes with probability and with .
In this case, becomes the input of the next nodes with probability as it is and with probability , it gives to the next nodes.
At the training, the parameters are updated with activated parameters and deactivated parameters. At the test, the updated parameters are multiplied by and the values are used.

“model combination” means that at the training step, with activated nodes some different networks construct combination of the models. When the architecture has nodes, networks can appear.

Adjust dropout parameter

Dropout has one parameter . To fix this , two types of setting was tried in the thesis.
  1. adjust with fixed number of nodes of specified layer
  2. adjust and node number with fixing
The first one means that you can change the activated nodes number rate and check accuracy and loss. The second one means that you can change with fixing the activated nodes number and check accuracy and loss.
Actually, if the number of activated nodes is not enough, the training doesn’t go well. If the ratio of activated nodes to deactivated ones is high, the over-fitting is easily observed.
So, it is efficient to adjust parameters from the viewpoint of those two settings.
Roughly, just after input, more or less 80% of nodes should be activated and on hidden layers 50% is favorable.

Personally, following points I cared about.
  • the parameter is set per nodes not on layer
  • on thesis, means activated nodes rate but on Keras, means deactivated nodes rate

Reference

Mr.Geoffrey E. Hinton’s page has the link for the thesis about dropout.
On the thesis, some experiments to many types of data are written.