What is a back propagation technique?
Arguably the most pivotal cog in the neural network system.
(This article covers a very specific concept within neural networks. For a more general knowledge on the NN algorithm, please read here.)
Back propagation is usually not an easy concept to get a grasp of. Across the literature available online on the subject, there is variety, even in the way people understand and explain it. And to match up all these understandings to what the concept actually means is not a trivial job. For this very reason, I have provided a simple and intuitive explanation of the concept below, along with a mathematical proof.
Goal of Back Propagation
Consider the simple neural network given below:
In a nutshell, we have a linear combination of the features going from the input layer to the hidden layer:
There are activation functions being run on the linear combinations given above to get to the hidden layer:
We then take these hidden layers and do a linear combination of them:
And finally, we run an activation function (sigmoid, in this example) to get the predicted probability:
The set-up given above is implied to run for one training example (containing input features X1 and X2). Say you have “n” training examples. Each of them would be fed through this neural network to get an individual predicted probability for each of them.
But how exactly do we calculate these weights and bias terms? In the figure above, for example, there are nine of these terms — six that happen from the inputs to the hidden layer and three that happen from the hidden layer to the output. That is, indeed, the million dollar question.
Gradient Descent
A very straightforward technique called gradient descent is used to achieve the objective here.
Let’s start with an error function for our predictions using the NN algorithm, say, E = 1/2(||p-y||)². The term “p” here represents the predicted probability of every training example, and “y” represents the true response. The L2 norm (Euclidean norm) difference between the two is an easy representation of how far off we are in terms of the predicted response, given the current “w” and “b” values that we have set:
The real question we are trying to answer is what effect a small change in one of these “w” or “b” terms will have on the error function. The lesser the error, the better the prediction (duh!). We ask the same question for each of these terms, and once we have an answer for every term, we travel in the direction that minimises the error the fastest.
The partial derivative of “E” with respect to each of the “w” and “b” terms gives us this— change just one term, and evaluate how the error changes.
A natural way would be to begin by changing the six coefficients between the input layer and the first hidden layer, since that is what the neural network does first. But as you’ll find out below, and as the name “backward propagation” implies, it is much more efficient a technique to work “backwards” from the final output, “p”. The short reason why that works better is that, by computing the weights and biases closest to the output, we are able to use these answers to help answer the question about the gradient in the immediately preceding layer. In essence, what we’re therefore doing is using the gradient computed in one layer and “propagating” it backwards to help calculate the gradient of the layer that came before it.
The math behind
Let us say we want to calculate the partial derivative of the error function “E” with respect to the term w1(2), a weight which begins at the hidden layer H1 as the input, runs it through an activation function, and eventually ends up at p:
The partial derivative of the error function with respect to w1(2) may be represented as:
All we now have to do is calculate these individual partial derivatives (of each Ei) and sum them up — pretty straightforward. We’ll use the chain rule in calculus to help get there.
In simple terms, the chain rule states that we can calculate the effect of changing one term on a term downstream by multiplying each gradient along the way. For e.g., a small change in the term w1(2) would result in a small change in Z1(2), which in turn will result in a small change in pi, which in turn will result in a small change in Ei (refer the basic neural network equations we mentioned earlier for clarity). Chain rule allows us to multiply these small changes to get to the overall change we desire to compute:
If p = Sigmoid(x) = 1/(1+e^-x), the partial derivative of p with respect to x will be p(1-p), which is a standard derivative. But a quick explanation is below nonetheless:
Since we are using linear combinations, the partial derivatives are thankfully easy to compute:
Let’s keep this output in the back pocket for now and compute the gradient for another term in the preceding layer, say w21(1):
On taking individual partial derivatives, we observe that a familiar term has now appeared at the end of the formulation, which we just computed a minute ago:
Since we are “propagating” backwards and storing these partial answers along the way, they will prove useful in computing the partial derivatives in the preceding layer, which is the essence of back propagation. If you look closely, you will notice that there is a pattern to this gradient value. A gradient for any given term can be calculated using a generic form:
Gradient = (‘From’ node) * (‘To’ node) * (1 — ‘To’ node) * delta(‘To’ node)
where “From” and “To” denote the nodes it originates and ends at.
Process
In the context of solving for these weights, a summary of the entire process is given below:
Step 1: Randomly initialise all the weights and biases to, say, a value ranging from 0 to 1.
Step 2: Calculate all relevant partial derivatives using the back propagation technique.
Step 3: Move in the direction of the steepest descent. Use gradient descent to go in the direction that causes the error function to decrease the fastest.
Step 4: Continue updating and recalculating the gradients until a stopping condition is met, to obtain ideal weights and bias values.
Conclusion
Even if the math might look confusing for starters, there is an easy pattern that is formed in calculating every gradient. The patterns arise solely because we are propagating these derivatives backwards, using the changes we obtain at a given layer for computing loss at a preceding layer.